Files
2nd/10_Wiki/Topics/AI_and_ML/Trustworthy-AI.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

7.2 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-trustworthy-ai Trustworthy AI 10_Wiki/Topics verified self
Responsible AI
Ethical AI
none A 0.92 applied
ai-ethics
governance
safety
regulation
NIST
EU-AI-Act
2026-05-10 pending
language framework
any NIST-AI-RMF

Trustworthy AI

매 한 줄

"매 AI system 이 매 reliable · safe · fair · transparent · accountable · privacy-preserving 의 6 axes 동시 만족." NIST AI RMF (2023) 와 EU AI Act (2024 enacted, 2026 fully applicable) 의 통합 frame. 매 single property 가 아닌 매 multi-dim balance — 매 trade-off 의 명시 가 매 핵심.

매 핵심

매 6 Pillars (NIST AI RMF)

  • Valid & Reliable: 매 intended task 에서 매 정확 + 매 deployment 환경 에서 매 stable.
  • Safe: 매 physical · psychological · environmental harm 의 회피.
  • Secure & Resilient: 매 adversarial attack · data poisoning · prompt injection 의 방어.
  • Accountable & Transparent: 매 누가 책임 + 매 어떻게 결정 의 명시.
  • Explainable & Interpretable: 매 stakeholder level 에 맞는 매 reasoning 공개.
  • Privacy-Enhanced: 매 data minimization · DP · federated learning.
  • Fair, Bias 의 management: 매 disparate impact 의 측정 + mitigation.

매 EU AI Act risk tiers (2026 fully applicable)

  • Unacceptable: social scoring, real-time biometric ID (대체로 ban).
  • High-risk: medical, hiring, credit, education — 매 conformity assessment + 매 CE marking 필수.
  • Limited risk: chatbots, deepfakes — 매 transparency obligation (AI 라고 명시).
  • Minimal risk: spam filter, video game AI — 매 voluntary code.

매 governance lifecycle

  1. Map: context, stakeholder, risk identification.
  2. Measure: 매 quantitative + 매 qualitative metric.
  3. Manage: 매 mitigation, monitoring, incident response.
  4. Govern: 매 policy, role, accountability.

매 응용

  1. High-risk deployment: 매 healthcare diagnosis AI 매 FDA + EU AI Act dual conformity.
  2. LLM production: 매 prompt injection defense + 매 PII redaction + 매 output filter.
  3. Hiring algorithm: 매 NYC Local Law 144 (bias audit) + 매 EEOC compliance.

💻 패턴

매 Bias measurement (group fairness)

from fairlearn.metrics import (
    MetricFrame, demographic_parity_difference, equalized_odds_difference
)
from sklearn.metrics import accuracy_score

mf = MetricFrame(
    metrics={"accuracy": accuracy_score},
    y_true=y_test,
    y_pred=y_pred,
    sensitive_features=df_test["gender"],
)
print(mf.by_group)

dpd = demographic_parity_difference(y_test, y_pred, sensitive_features=df_test["gender"])
eod = equalized_odds_difference(y_test, y_pred, sensitive_features=df_test["gender"])
print(f"DP diff: {dpd:.3f}, EO diff: {eod:.3f}")
# 매 |DP| > 0.1 → 매 disparate impact 의심

매 LLM output guardrail (Llama Guard 3)

from transformers import AutoTokenizer, AutoModelForCausalLM

guard = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-Guard-3-8B")
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-Guard-3-8B")

def is_safe(user_msg: str, assistant_msg: str) -> tuple[bool, str]:
    prompt = tok.apply_chat_template(
        [{"role": "user", "content": user_msg},
         {"role": "assistant", "content": assistant_msg}],
        tokenize=False,
    )
    out = guard.generate(tok(prompt, return_tensors="pt").input_ids, max_new_tokens=20)
    verdict = tok.decode(out[0], skip_special_tokens=True).strip().split("\n")[-1]
    return verdict.startswith("safe"), verdict

매 Differential privacy (Opacus)

from opacus import PrivacyEngine
import torch

model = MyModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
engine = PrivacyEngine()
model, optimizer, loader = engine.make_private_with_epsilon(
    module=model, optimizer=optimizer, data_loader=loader,
    target_epsilon=3.0, target_delta=1e-5, epochs=10, max_grad_norm=1.0,
)
# 매 ε=3 의 strong privacy guarantee

매 Model card (Hugging Face)

# README.md frontmatter
language: en
license: apache-2.0
intended_use:
  primary: "English sentiment classification (product reviews)"
  out_of_scope: ["clinical text", "non-English", "financial advice"]
training_data:
  source: "Amazon reviews 2018-2024 (50M samples)"
  known_biases: ["English-skewed", "tech product overrepresented"]
metrics:
  accuracy: 0.92
  demographic_parity_diff: 0.04
limitations:
  - "Sarcasm detection 약함 (F1 0.61)"
  - "Long reviews (>1000 tokens) 의 truncation"
ethical_considerations:
  - "매 hiring · loan 결정 의 사용 X"

매 Explainability (SHAP for tabular)

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# 매 individual explanation
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0])

# 매 global feature importance
shap.summary_plot(shap_values, X_test)

매 Adversarial robustness test

from textattack.attack_recipes import TextFoolerJin2019
from textattack.models.wrappers import HuggingFaceModelWrapper

wrapped = HuggingFaceModelWrapper(model, tokenizer)
attack = TextFoolerJin2019.build(wrapped)
results = [attack.attack(text, label) for text, label in test_samples]
robust_acc = sum(r.perturbed_result.score == r.original_result.score for r in results) / len(results)
print(f"Robust accuracy: {robust_acc:.2f}")

매 결정 기준

상황 Approach
매 EU 시장 high-risk 매 AI Act conformity assessment 필수
매 internal-only LLM 매 model card + 매 basic guardrail (Llama Guard)
매 medical · hiring · credit 매 6 pillar full + 매 third-party audit
매 generative content 매 watermark (C2PA) + 매 deepfake disclosure

기본값: 매 production AI 는 매 model card + 매 bias measurement + 매 output filter 최소.

🔗 Graph

🤖 LLM 활용

언제: 매 production LLM 의 매 prompt injection · PII leak · biased output 의 multi-layer defense. 언제 X: 매 prototype, 매 internal demo — 매 over-engineering.

안티패턴

  • 매 Checkbox compliance: 매 model card 작성하고 매 끝 — 매 ongoing monitoring 의 X.
  • 매 Single-axis focus: 매 fairness 만 chasing → 매 accuracy 의 sacrifice — 매 trade-off 의 unacknowledged.
  • 매 Privacy theater: 매 "anonymized" 라 부르고 매 re-identification 의 vulnerable.
  • 매 Explainability 의 hallucination: 매 LLM-generated explanation 의 매 actual reasoning 과 매 mismatch.

🧪 검증 / 중복

  • Verified (NIST AI RMF 1.0, 2023; EU AI Act, OJ L 2024/1689; ISO/IEC 42001:2023).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — NIST RMF + EU AI Act 2026 enforcement, Llama Guard 3 패턴 추가