--- id: wiki-2026-0508-trustworthy-ai title: Trustworthy AI category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Responsible AI, Ethical AI] duplicate_of: none source_trust_level: A confidence_score: 0.92 verification_status: applied tags: [ai-ethics, governance, safety, regulation, NIST, EU-AI-Act] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: any framework: NIST-AI-RMF --- # Trustworthy AI ## 매 한 줄 > **"매 AI system 이 매 reliable · safe · fair · transparent · accountable · privacy-preserving 의 6 axes 동시 만족."** NIST AI RMF (2023) 와 EU AI Act (2024 enacted, 2026 fully applicable) 의 통합 frame. 매 single property 가 아닌 매 multi-dim balance — 매 trade-off 의 명시 가 매 핵심. ## 매 핵심 ### 매 6 Pillars (NIST AI RMF) - **Valid & Reliable**: 매 intended task 에서 매 정확 + 매 deployment 환경 에서 매 stable. - **Safe**: 매 physical · psychological · environmental harm 의 회피. - **Secure & Resilient**: 매 adversarial attack · data poisoning · prompt injection 의 방어. - **Accountable & Transparent**: 매 누가 책임 + 매 어떻게 결정 의 명시. - **Explainable & Interpretable**: 매 stakeholder level 에 맞는 매 reasoning 공개. - **Privacy-Enhanced**: 매 data minimization · DP · federated learning. - **Fair, Bias 의 management**: 매 disparate impact 의 측정 + mitigation. ### 매 EU AI Act risk tiers (2026 fully applicable) - **Unacceptable**: social scoring, real-time biometric ID (대체로 ban). - **High-risk**: medical, hiring, credit, education — 매 conformity assessment + 매 CE marking 필수. - **Limited risk**: chatbots, deepfakes — 매 transparency obligation (AI 라고 명시). - **Minimal risk**: spam filter, video game AI — 매 voluntary code. ### 매 governance lifecycle 1. **Map**: context, stakeholder, risk identification. 2. **Measure**: 매 quantitative + 매 qualitative metric. 3. **Manage**: 매 mitigation, monitoring, incident response. 4. **Govern**: 매 policy, role, accountability. ### 매 응용 1. **High-risk deployment**: 매 healthcare diagnosis AI 매 FDA + EU AI Act dual conformity. 2. **LLM production**: 매 prompt injection defense + 매 PII redaction + 매 output filter. 3. **Hiring algorithm**: 매 NYC Local Law 144 (bias audit) + 매 EEOC compliance. ## 💻 패턴 ### 매 Bias measurement (group fairness) ```python from fairlearn.metrics import ( MetricFrame, demographic_parity_difference, equalized_odds_difference ) from sklearn.metrics import accuracy_score mf = MetricFrame( metrics={"accuracy": accuracy_score}, y_true=y_test, y_pred=y_pred, sensitive_features=df_test["gender"], ) print(mf.by_group) dpd = demographic_parity_difference(y_test, y_pred, sensitive_features=df_test["gender"]) eod = equalized_odds_difference(y_test, y_pred, sensitive_features=df_test["gender"]) print(f"DP diff: {dpd:.3f}, EO diff: {eod:.3f}") # 매 |DP| > 0.1 → 매 disparate impact 의심 ``` ### 매 LLM output guardrail (Llama Guard 3) ```python from transformers import AutoTokenizer, AutoModelForCausalLM guard = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-Guard-3-8B") tok = AutoTokenizer.from_pretrained("meta-llama/Llama-Guard-3-8B") def is_safe(user_msg: str, assistant_msg: str) -> tuple[bool, str]: prompt = tok.apply_chat_template( [{"role": "user", "content": user_msg}, {"role": "assistant", "content": assistant_msg}], tokenize=False, ) out = guard.generate(tok(prompt, return_tensors="pt").input_ids, max_new_tokens=20) verdict = tok.decode(out[0], skip_special_tokens=True).strip().split("\n")[-1] return verdict.startswith("safe"), verdict ``` ### 매 Differential privacy (Opacus) ```python from opacus import PrivacyEngine import torch model = MyModel() optimizer = torch.optim.SGD(model.parameters(), lr=0.05) engine = PrivacyEngine() model, optimizer, loader = engine.make_private_with_epsilon( module=model, optimizer=optimizer, data_loader=loader, target_epsilon=3.0, target_delta=1e-5, epochs=10, max_grad_norm=1.0, ) # 매 ε=3 의 strong privacy guarantee ``` ### 매 Model card (Hugging Face) ```yaml # README.md frontmatter language: en license: apache-2.0 intended_use: primary: "English sentiment classification (product reviews)" out_of_scope: ["clinical text", "non-English", "financial advice"] training_data: source: "Amazon reviews 2018-2024 (50M samples)" known_biases: ["English-skewed", "tech product overrepresented"] metrics: accuracy: 0.92 demographic_parity_diff: 0.04 limitations: - "Sarcasm detection 약함 (F1 0.61)" - "Long reviews (>1000 tokens) 의 truncation" ethical_considerations: - "매 hiring · loan 결정 의 사용 X" ``` ### 매 Explainability (SHAP for tabular) ```python import shap explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) # 매 individual explanation shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0]) # 매 global feature importance shap.summary_plot(shap_values, X_test) ``` ### 매 Adversarial robustness test ```python from textattack.attack_recipes import TextFoolerJin2019 from textattack.models.wrappers import HuggingFaceModelWrapper wrapped = HuggingFaceModelWrapper(model, tokenizer) attack = TextFoolerJin2019.build(wrapped) results = [attack.attack(text, label) for text, label in test_samples] robust_acc = sum(r.perturbed_result.score == r.original_result.score for r in results) / len(results) print(f"Robust accuracy: {robust_acc:.2f}") ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | 매 EU 시장 high-risk | 매 AI Act conformity assessment 필수 | | 매 internal-only LLM | 매 model card + 매 basic guardrail (Llama Guard) | | 매 medical · hiring · credit | 매 6 pillar full + 매 third-party audit | | 매 generative content | 매 watermark (C2PA) + 매 deepfake disclosure | **기본값**: 매 production AI 는 매 model card + 매 bias measurement + 매 output filter 최소. ## 🔗 Graph - 부모: [[AI Ethics]] · [[AI 거버넌스 정책(AI Usage Policy)|AI Governance]] - 변형: [[Responsible AI]] · [[Ethical AI]] - 응용: [[NIST AI RMF]] · [[EU AI Act]] · [[ISO 42001]] - Adjacent: [[Model Card]] · [[Differential Privacy]] · [[Adversarial Robustness]] ## 🤖 LLM 활용 **언제**: 매 production LLM 의 매 prompt injection · PII leak · biased output 의 multi-layer defense. **언제 X**: 매 prototype, 매 internal demo — 매 over-engineering. ## ❌ 안티패턴 - **매 Checkbox compliance**: 매 model card 작성하고 매 끝 — 매 ongoing monitoring 의 X. - **매 Single-axis focus**: 매 fairness 만 chasing → 매 accuracy 의 sacrifice — 매 trade-off 의 unacknowledged. - **매 Privacy theater**: 매 "anonymized" 라 부르고 매 re-identification 의 vulnerable. - **매 Explainability 의 hallucination**: 매 LLM-generated explanation 의 매 actual reasoning 과 매 mismatch. ## 🧪 검증 / 중복 - Verified (NIST AI RMF 1.0, 2023; EU AI Act, OJ L 2024/1689; ISO/IEC 42001:2023). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — NIST RMF + EU AI Act 2026 enforcement, Llama Guard 3 패턴 추가 |