---
id: wiki-2026-0508-trustworthy-ai
title: Trustworthy AI
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Responsible AI, Ethical AI]
duplicate_of: none
source_trust_level: A
confidence_score: 0.92
verification_status: applied
tags: [ai-ethics, governance, safety, regulation, NIST, EU-AI-Act]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: any
  framework: NIST-AI-RMF
---

# Trustworthy AI

## 매 한 줄
> **"매 AI system 이 매 reliable · safe · fair · transparent · accountable · privacy-preserving 의 6 axes 동시 만족."** NIST AI RMF (2023) 와 EU AI Act (2024 enacted, 2026 fully applicable) 의 통합 frame. 매 single property 가 아닌 매 multi-dim balance — 매 trade-off 의 명시 가 매 핵심.

## 매 핵심

### 매 6 Pillars (NIST AI RMF)
- **Valid & Reliable**: 매 intended task 에서 매 정확 + 매 deployment 환경 에서 매 stable.
- **Safe**: 매 physical · psychological · environmental harm 의 회피.
- **Secure & Resilient**: 매 adversarial attack · data poisoning · prompt injection 의 방어.
- **Accountable & Transparent**: 매 누가 책임 + 매 어떻게 결정 의 명시.
- **Explainable & Interpretable**: 매 stakeholder level 에 맞는 매 reasoning 공개.
- **Privacy-Enhanced**: 매 data minimization · DP · federated learning.
- **Fair, Bias 의 management**: 매 disparate impact 의 측정 + mitigation.

### 매 EU AI Act risk tiers (2026 fully applicable)
- **Unacceptable**: social scoring, real-time biometric ID (대체로 ban).
- **High-risk**: medical, hiring, credit, education — 매 conformity assessment + 매 CE marking 필수.
- **Limited risk**: chatbots, deepfakes — 매 transparency obligation (AI 라고 명시).
- **Minimal risk**: spam filter, video game AI — 매 voluntary code.

### 매 governance lifecycle
1. **Map**: context, stakeholder, risk identification.
2. **Measure**: 매 quantitative + 매 qualitative metric.
3. **Manage**: 매 mitigation, monitoring, incident response.
4. **Govern**: 매 policy, role, accountability.

### 매 응용
1. **High-risk deployment**: 매 healthcare diagnosis AI 매 FDA + EU AI Act dual conformity.
2. **LLM production**: 매 prompt injection defense + 매 PII redaction + 매 output filter.
3. **Hiring algorithm**: 매 NYC Local Law 144 (bias audit) + 매 EEOC compliance.

## 💻 패턴

### 매 Bias measurement (group fairness)
```python
from fairlearn.metrics import (
    MetricFrame, demographic_parity_difference, equalized_odds_difference
)
from sklearn.metrics import accuracy_score

mf = MetricFrame(
    metrics={"accuracy": accuracy_score},
    y_true=y_test,
    y_pred=y_pred,
    sensitive_features=df_test["gender"],
)
print(mf.by_group)

dpd = demographic_parity_difference(y_test, y_pred, sensitive_features=df_test["gender"])
eod = equalized_odds_difference(y_test, y_pred, sensitive_features=df_test["gender"])
print(f"DP diff: {dpd:.3f}, EO diff: {eod:.3f}")
# 매 |DP| > 0.1 → 매 disparate impact 의심
```

### 매 LLM output guardrail (Llama Guard 3)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

guard = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-Guard-3-8B")
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-Guard-3-8B")

def is_safe(user_msg: str, assistant_msg: str) -> tuple[bool, str]:
    prompt = tok.apply_chat_template(
        [{"role": "user", "content": user_msg},
         {"role": "assistant", "content": assistant_msg}],
        tokenize=False,
    )
    out = guard.generate(tok(prompt, return_tensors="pt").input_ids, max_new_tokens=20)
    verdict = tok.decode(out[0], skip_special_tokens=True).strip().split("\n")[-1]
    return verdict.startswith("safe"), verdict
```

### 매 Differential privacy (Opacus)
```python
from opacus import PrivacyEngine
import torch

model = MyModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
engine = PrivacyEngine()
model, optimizer, loader = engine.make_private_with_epsilon(
    module=model, optimizer=optimizer, data_loader=loader,
    target_epsilon=3.0, target_delta=1e-5, epochs=10, max_grad_norm=1.0,
)
# 매 ε=3 의 strong privacy guarantee
```

### 매 Model card (Hugging Face)
```yaml
# README.md frontmatter
language: en
license: apache-2.0
intended_use:
  primary: "English sentiment classification (product reviews)"
  out_of_scope: ["clinical text", "non-English", "financial advice"]
training_data:
  source: "Amazon reviews 2018-2024 (50M samples)"
  known_biases: ["English-skewed", "tech product overrepresented"]
metrics:
  accuracy: 0.92
  demographic_parity_diff: 0.04
limitations:
  - "Sarcasm detection 약함 (F1 0.61)"
  - "Long reviews (>1000 tokens) 의 truncation"
ethical_considerations:
  - "매 hiring · loan 결정 의 사용 X"
```

### 매 Explainability (SHAP for tabular)
```python
import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# 매 individual explanation
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0])

# 매 global feature importance
shap.summary_plot(shap_values, X_test)
```

### 매 Adversarial robustness test
```python
from textattack.attack_recipes import TextFoolerJin2019
from textattack.models.wrappers import HuggingFaceModelWrapper

wrapped = HuggingFaceModelWrapper(model, tokenizer)
attack = TextFoolerJin2019.build(wrapped)
results = [attack.attack(text, label) for text, label in test_samples]
robust_acc = sum(r.perturbed_result.score == r.original_result.score for r in results) / len(results)
print(f"Robust accuracy: {robust_acc:.2f}")
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| 매 EU 시장 high-risk | 매 AI Act conformity assessment 필수 |
| 매 internal-only LLM | 매 model card + 매 basic guardrail (Llama Guard) |
| 매 medical · hiring · credit | 매 6 pillar full + 매 third-party audit |
| 매 generative content | 매 watermark (C2PA) + 매 deepfake disclosure |

**기본값**: 매 production AI 는 매 model card + 매 bias measurement + 매 output filter 최소.

## 🔗 Graph
- 부모: [[AI Ethics]] · [[AI 거버넌스 정책(AI Usage Policy)|AI Governance]]
- 변형: [[Responsible AI]] · [[Ethical AI]]
- 응용: [[NIST AI RMF]] · [[EU AI Act]] · [[ISO 42001]]
- Adjacent: [[Model Card]] · [[Differential Privacy]] · [[Adversarial Robustness]]

## 🤖 LLM 활용
**언제**: 매 production LLM 의 매 prompt injection · PII leak · biased output 의 multi-layer defense.
**언제 X**: 매 prototype, 매 internal demo — 매 over-engineering.

## ❌ 안티패턴
- **매 Checkbox compliance**: 매 model card 작성하고 매 끝 — 매 ongoing monitoring 의 X.
- **매 Single-axis focus**: 매 fairness 만 chasing → 매 accuracy 의 sacrifice — 매 trade-off 의 unacknowledged.
- **매 Privacy theater**: 매 "anonymized" 라 부르고 매 re-identification 의 vulnerable.
- **매 Explainability 의 hallucination**: 매 LLM-generated explanation 의 매 actual reasoning 과 매 mismatch.

## 🧪 검증 / 중복
- Verified (NIST AI RMF 1.0, 2023; EU AI Act, OJ L 2024/1689; ISO/IEC 42001:2023).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — NIST RMF + EU AI Act 2026 enforcement, Llama Guard 3 패턴 추가 |