--- id: wiki-2026-0508-explainable-ai-xai title: Explainable AI (XAI) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [XAI, interpretability, SHAP, LIME, attention, mechanistic interpretability, saliency] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [ai, interpretability, xai, shap, lime, attention, mechanistic, transparency] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: SHAP / LIME / Captum / TransformerLens --- # Explainable AI (XAI) ## 매 한 줄 > **"매 black-box 의 decision 의 understand"**. 매 SHAP / LIME (post-hoc), 매 attention / saliency, 매 mechanistic interpretability (modern). 매 EU AI Act 의 high-risk 의 require. 매 trade-off: 매 accuracy ↔ interpretability (often false dichotomy). ## 매 핵심 ### 매 dimension - **Local vs global**: 매 single prediction vs overall model. - **Model-specific vs agnostic**: 매 internals vs black-box. - **Post-hoc vs intrinsic**: 매 after training vs by design. ### 매 method - **Feature importance**: SHAP, LIME, permutation. - **Saliency**: Grad-CAM, Integrated Gradients. - **Attention**: 매 transformer. - **Counterfactual**: 매 minimal change. - **Mechanistic**: 매 circuit, SAE, attribution. - **Concept-based**: TCAV. ### 매 modern (mechanistic) - **TransformerLens** (Anthropic). - **Sparse Autoencoders** (SAE). - **Activation Patching**. - **Probing**. - **Anthropic Circuits**: 매 Towards Monosemanticity. ### 매 응용 1. **Healthcare**: 매 diagnostic explain. 2. **Finance**: 매 credit decision. 3. **Justice**: 매 risk score. 4. **Debugging**: 매 model failure. 5. **Compliance**: 매 EU AI Act. 6. **AI safety**: 매 alignment audit. ## 💻 패턴 ### SHAP (TreeExplainer) ```python import shap import xgboost as xgb model = xgb.XGBClassifier().fit(X, y) explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) shap.summary_plot(shap_values, X_test) shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0]) ``` ### LIME ```python from lime.lime_tabular import LimeTabularExplainer explainer = LimeTabularExplainer(X_train, feature_names=cols, class_names=['neg', 'pos']) exp = explainer.explain_instance(X_test[0], model.predict_proba, num_features=10) exp.show_in_notebook() ``` ### Integrated Gradients (Captum) ```python import captum from captum.attr import IntegratedGradients ig = IntegratedGradients(model) attributions = ig.attribute(input_tensor, target=label_idx, n_steps=50) ``` ### Grad-CAM (vision) ```python from captum.attr import LayerGradCam import torchvision.models as models model = models.resnet50(pretrained=True).eval() gc = LayerGradCam(model, model.layer4[-1]) attribution = gc.attribute(input_image, target=class_idx) ``` ### Attention visualization (transformer) ```python import torch def attention_rollout(attentions, layer): """매 average heads, multiply layers.""" A = attentions[layer] # 매 [B, H, T, T] A = A.mean(dim=1) # 매 average heads return A ``` ### Counterfactual explanation ```python def counterfactual(model, x, target_class, max_iter=100): """매 minimum change to flip prediction.""" x_cf = x.clone().detach().requires_grad_(True) optim = torch.optim.Adam([x_cf], lr=0.01) for _ in range(max_iter): pred = model(x_cf) if pred.argmax() == target_class: return x_cf loss = F.cross_entropy(pred, torch.tensor([target_class])) + 0.1 * (x_cf - x).norm() optim.zero_grad(); loss.backward(); optim.step() return x_cf ``` ### Permutation importance ```python from sklearn.inspection import permutation_importance result = permutation_importance(model, X_val, y_val, n_repeats=10, random_state=0) sorted_idx = result.importances_mean.argsort()[::-1] for i in sorted_idx[:10]: print(f'{cols[i]}: {result.importances_mean[i]:.4f}') ``` ### TCAV (concept-based) ```python def tcav_score(model, layer, concept_examples, random_examples, target_class): cav = train_cav(layer.activations(concept_examples), layer.activations(random_examples)) sensitivities = [] for x in target_class_examples: grad = compute_gradient_at_layer(model, x, target_class, layer) sensitivities.append((grad @ cav) > 0) return np.mean(sensitivities) ``` ### Mechanistic — activation patching ```python import transformer_lens as tl model = tl.HookedTransformer.from_pretrained('gpt2') def patched_forward(prompt_clean, prompt_corrupt, layer): _, clean_cache = model.run_with_cache(prompt_clean) def patch_hook(activation, hook): return clean_cache[hook.name] return model.run_with_hooks(prompt_corrupt, fwd_hooks=[(f'blocks.{layer}.attn.hook_z', patch_hook)]) ``` ### Sparse Autoencoder (SAE) ```python class SAE(nn.Module): def __init__(self, d_model, d_sae=32768, l1_coef=1e-3): super().__init__() self.W_enc = nn.Linear(d_model, d_sae) self.W_dec = nn.Linear(d_sae, d_model, bias=False) self.l1_coef = l1_coef def forward(self, x): z = F.relu(self.W_enc(x)) x_recon = self.W_dec(z) recon_loss = F.mse_loss(x_recon, x) sparsity_loss = self.l1_coef * z.abs().sum(-1).mean() return x_recon, recon_loss + sparsity_loss ``` ### Probing classifier ```python def probe(activations, labels, layer): X = activations[layer].detach().numpy() clf = LogisticRegression(max_iter=1000).fit(X, labels) return clf.score(X_val, y_val) ``` ### Model card explainability ```yaml explainability: method: SHAP TreeExplainer global_top_features: - credit_score: 0.34 - debt_to_income: 0.18 - employment_years: 0.12 per_decision_explanation: available_in_ui counterfactual_offered: yes ``` ### LLM explanation prompt ```python def llm_explain_decision(prediction, features): prompt = f"""You are explaining an ML decision. Use ONLY the features given. Prediction: {prediction} Features: {features} Output: 1. Top 3 reasons 2. What change would flip the decision 3. Limitations of this explanation""" return llm.generate(prompt) ``` ## 매 결정 기준 | 상황 | Method | |---|---| | Tabular ML | SHAP TreeExplainer | | Black-box agnostic | LIME / SHAP KernelExplainer | | Vision | Grad-CAM / Integrated Gradients | | NLP | Attention + token attribution | | LLM internals | TransformerLens / SAE | | User-facing | Counterfactual + plain language | | Compliance | Model card + global + local | **기본값**: 매 SHAP global + local + 매 counterfactual + 매 model card. 매 LLM internals = SAE + activation patching. ## 🔗 Graph - 부모: [[AI]] · [[Interpretability]] - 변형: [[SHAP]] · [[LIME]] · [[Mechanistic-Interpretability]] - 응용: [[Model-Card]] · [[EU-AI-Act]] · [[AI-Safety]] - Adjacent: [[Sparse-Autoencoder]] · [[Ethics & AI]] ## 🤖 LLM 활용 **언제**: 매 high-risk EU. 매 healthcare / finance. 매 model debug. **언제 X**: 매 low-stakes. 매 explanation 의 misleading 의 risk. ## ❌ 안티패턴 - **Single number trust**: 매 SHAP 의 misuse. - **Saliency as causal**: 매 correlation only. - **Attention = explanation**: 매 not always (Jain & Wallace 2019). - **Post-hoc only**: 매 design 의 ignore. - **Explanation 의 misleading**: 매 false confidence. ## 🧪 검증 / 중복 - Verified (Lundberg SHAP 2017, Ribeiro LIME 2016, Olah Anthropic 2024). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — XAI methods + 매 SHAP / LIME / IG / counterfactual / SAE / probing code |