Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

7.2 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Trustworthy AI

매 한 줄

"매 AI system 이 매 reliable · safe · fair · transparent · accountable · privacy-preserving 의 6 axes 동시 만족." NIST AI RMF (2023) 와 EU AI Act (2024 enacted, 2026 fully applicable) 의 통합 frame. 매 single property 가 아닌 매 multi-dim balance — 매 trade-off 의 명시 가 매 핵심.

매 핵심

매 6 Pillars (NIST AI RMF)

Valid & Reliable: 매 intended task 에서 매 정확 + 매 deployment 환경 에서 매 stable.
Safe: 매 physical · psychological · environmental harm 의 회피.
Secure & Resilient: 매 adversarial attack · data poisoning · prompt injection 의 방어.
Accountable & Transparent: 매 누가 책임 + 매 어떻게 결정 의 명시.
Explainable & Interpretable: 매 stakeholder level 에 맞는 매 reasoning 공개.
Privacy-Enhanced: 매 data minimization · DP · federated learning.
Fair, Bias 의 management: 매 disparate impact 의 측정 + mitigation.

매 EU AI Act risk tiers (2026 fully applicable)

Unacceptable: social scoring, real-time biometric ID (대체로 ban).
High-risk: medical, hiring, credit, education — 매 conformity assessment + 매 CE marking 필수.
Limited risk: chatbots, deepfakes — 매 transparency obligation (AI 라고 명시).
Minimal risk: spam filter, video game AI — 매 voluntary code.

매 governance lifecycle

Map: context, stakeholder, risk identification.
Measure: 매 quantitative + 매 qualitative metric.
Manage: 매 mitigation, monitoring, incident response.
Govern: 매 policy, role, accountability.

매 응용

High-risk deployment: 매 healthcare diagnosis AI 매 FDA + EU AI Act dual conformity.
LLM production: 매 prompt injection defense + 매 PII redaction + 매 output filter.
Hiring algorithm: 매 NYC Local Law 144 (bias audit) + 매 EEOC compliance.

💻 패턴

매 Bias measurement (group fairness)

from fairlearn.metrics import (
    MetricFrame, demographic_parity_difference, equalized_odds_difference
)
from sklearn.metrics import accuracy_score

mf = MetricFrame(
    metrics={"accuracy": accuracy_score},
    y_true=y_test,
    y_pred=y_pred,
    sensitive_features=df_test["gender"],
)
print(mf.by_group)

dpd = demographic_parity_difference(y_test, y_pred, sensitive_features=df_test["gender"])
eod = equalized_odds_difference(y_test, y_pred, sensitive_features=df_test["gender"])
print(f"DP diff: {dpd:.3f}, EO diff: {eod:.3f}")
# 매 |DP| > 0.1 → 매 disparate impact 의심

매 LLM output guardrail (Llama Guard 3)

from transformers import AutoTokenizer, AutoModelForCausalLM

guard = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-Guard-3-8B")
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-Guard-3-8B")

def is_safe(user_msg: str, assistant_msg: str) -> tuple[bool, str]:
    prompt = tok.apply_chat_template(
        [{"role": "user", "content": user_msg},
         {"role": "assistant", "content": assistant_msg}],
        tokenize=False,
    )
    out = guard.generate(tok(prompt, return_tensors="pt").input_ids, max_new_tokens=20)
    verdict = tok.decode(out[0], skip_special_tokens=True).strip().split("\n")[-1]
    return verdict.startswith("safe"), verdict

매 Differential privacy (Opacus)

from opacus import PrivacyEngine
import torch

model = MyModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
engine = PrivacyEngine()
model, optimizer, loader = engine.make_private_with_epsilon(
    module=model, optimizer=optimizer, data_loader=loader,
    target_epsilon=3.0, target_delta=1e-5, epochs=10, max_grad_norm=1.0,
)
# 매 ε=3 의 strong privacy guarantee

매 Model card (Hugging Face)

# README.md frontmatter
language: en
license: apache-2.0
intended_use:
  primary: "English sentiment classification (product reviews)"
  out_of_scope: ["clinical text", "non-English", "financial advice"]
training_data:
  source: "Amazon reviews 2018-2024 (50M samples)"
  known_biases: ["English-skewed", "tech product overrepresented"]
metrics:
  accuracy: 0.92
  demographic_parity_diff: 0.04
limitations:
  - "Sarcasm detection 약함 (F1 0.61)"
  - "Long reviews (>1000 tokens) 의 truncation"
ethical_considerations:
  - "매 hiring · loan 결정 의 사용 X"

매 Explainability (SHAP for tabular)

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# 매 individual explanation
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0])

# 매 global feature importance
shap.summary_plot(shap_values, X_test)

매 Adversarial robustness test

from textattack.attack_recipes import TextFoolerJin2019
from textattack.models.wrappers import HuggingFaceModelWrapper

wrapped = HuggingFaceModelWrapper(model, tokenizer)
attack = TextFoolerJin2019.build(wrapped)
results = [attack.attack(text, label) for text, label in test_samples]
robust_acc = sum(r.perturbed_result.score == r.original_result.score for r in results) / len(results)
print(f"Robust accuracy: {robust_acc:.2f}")

매 결정 기준

상황	Approach
매 EU 시장 high-risk	매 AI Act conformity assessment 필수
매 internal-only LLM	매 model card + 매 basic guardrail (Llama Guard)
매 medical · hiring · credit	매 6 pillar full + 매 third-party audit
매 generative content	매 watermark (C2PA) + 매 deepfake disclosure

기본값: 매 production AI 는 매 model card + 매 bias measurement + 매 output filter 최소.

🔗 Graph

부모: AI Ethics · AI Governance
변형: Responsible AI · Ethical AI
응용: NIST AI RMF · EU AI Act · ISO 42001
Adjacent: Model Card · Differential Privacy · Adversarial Robustness · AI Red Team

🤖 LLM 활용

언제: 매 production LLM 의 매 prompt injection · PII leak · biased output 의 multi-layer defense. 언제 X: 매 prototype, 매 internal demo — 매 over-engineering.

❌ 안티패턴

매 Checkbox compliance: 매 model card 작성하고 매 끝 — 매 ongoing monitoring 의 X.
매 Single-axis focus: 매 fairness 만 chasing → 매 accuracy 의 sacrifice — 매 trade-off 의 unacknowledged.
매 Privacy theater: 매 "anonymized" 라 부르고 매 re-identification 의 vulnerable.
매 Explainability 의 hallucination: 매 LLM-generated explanation 의 매 actual reasoning 과 매 mismatch.

🧪 검증 / 중복

Verified (NIST AI RMF 1.0, 2023; EU AI Act, OJ L 2024/1689; ISO/IEC 42001:2023).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — NIST RMF + EU AI Act 2026 enforcement, Llama Guard 3 패턴 추가

7.2 KiB Raw Blame History