[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,61 +2,138 @@
 id: wiki-2026-0508-mechanistic-interpretability-기계적
 title: Mechanistic Interpretability (기계적 해석 가능성)
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [P-Reinforce-AI-MECH-INTERP]
+aliases: [Mech Interp, Circuit Analysis, MI, 기계적 해석성]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 0.98
-tags: [AI, Interpretability, MechanisticInterpretability, AISafety]
-raw_sources: []
-last_reinforced: 2026-04-20
+confidence_score: 0.9
+verification_status: applied
+tags: [ai, interpretability, alignment, anthropic, transformer, safety]
+raw_sources: [Anthropic Transformer Circuits, Towards Monosemanticity, Scaling Monosemanticity]
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
+tech_stack: { language: python, framework: transformer-lens-sae-lens }
 ---

-# [[Mechanistic Interpretability (기계적 해석 가능성)|Mechanistic Interpretability (기계적 해석 가능성)]]
+# Mechanistic Interpretability (기계적 해석 가능성)

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "AI 신경망을 뜯어 리버스 엔지니어링하는 현대의 고등 해부학." 모델을 단순한 블랙박스로 보지 않고, 내부의 가중치와 뉴런들이 어떻게 결합하여 구체적인 '알고리즘'을 구현하는지 하나하나 밝혀내는 극도의 정밀 분석 기술이다.
+## 매 한 줄
+> **"매 뉴런을 회로로 읽는다"**. 모델 내부를 black-box 통계가 아니라 명시적 알고리즘(회로/feature)으로 reverse-engineer 하는 분야. Anthropic의 SAE/circuit 연구가 주축.

-## 📖 구조화된 지식 (Synthesized Content)
- **The Mission**: 모델의 가중치를 보고 "이 부분은 문법을 체크하고, 저 부분은 감정을 파악한다"라고 코드로 설명할 수 있을 정도로 깊게 이해하는 것.
- **Key Methodologies**:
-    - **Logit Lens**: 각 층이 예측하는 단어가 층을 거듭할수록 어떻게 변하는지 관찰.
-    - **Path Patching**: 특정 정보가 모델의 어느 혈관(Path)을 타고 흐르는지 추적.
-    - **Superposition Theory**: 뉴런 하나가 여러 의미를 동시에 담고 있는 '중첩' 현상을 분해함.
- **Significance**: AI가 우리를 속이려 하거나(Deceptive [[Alignment|Alignment]]) 위험한 생각을 하는지 사전에 감지할 수 있는 유일한 기술적 방패다.
+## 매 핵심
+### 매 핵심 개념
+- **Circuit**: 특정 행동을 구현하는 attention head + MLP 신경 sub-graph.
+- **Feature**: 활성화 공간의 의미 단위 (한 방향 벡터).
+- **Polysemanticity**: 한 뉴런이 여러 개념 인코딩 → superposition.
+- **SAE (Sparse Autoencoder)**: superposition을 풀어 monosemantic feature 추출.
+- **Probing / Logit Lens / Activation Patching**: 진단 도구.

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- 수천억 개의 파라미터를 가진 모델을 수작업으로 분석하는 것은 불가능하다. 따라서 최근에는 **'AI를 사용하여 AI를 해석'**하는 자동화된 해석 기술(Auto-Interp) 연구가 활발하며, 앤스로픽(Anthropic)의 'Dictionary Learning' 기법이 이 분야의 최전선을 달리고 있다.
+### 매 핵심 발견
+1. **Induction heads** (2022) - in-context learning 구현 회로.
+2. **IOI circuit** (2022) - Indirect Object Identification.
+3. **Toy Models of Superposition** (2022) - feature가 압축되는 이유.
+4. **Towards Monosemanticity** (2023) - SAE로 feature 추출 가능.
+5. **Scaling Monosemanticity** (2024) - Claude 3 Sonnet에 SAE 적용, "Golden Gate Bridge feature" 등.
+6. **Circuit Tracing / Attribution Graphs** (2025) - feature 간 인과 추적.

-## 🔗 지식 연결 (Graph)
- Related: [[Circuit Discovery (회로 발견)|Circuit Discovery (회로 발견)]] , [[Explainable-AI (XAI)|Explainable-AI (XAI)]]
- Risk Defense: [[Deceptive Alignment (기만적 정렬)|Deceptive Alignment (기만적 정렬)]]
+## 💻 패턴

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### Pattern 1 — TransformerLens (회로 분석)
+```python
+import transformer_lens as tl
+model = tl.HookedTransformer.from_pretrained('gpt2-small')
+logits, cache = model.run_with_cache("The capital of France is")
+# cache['blocks.5.attn.hook_pattern'] - attention 패턴 검사
+```

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+### Pattern 2 — Activation Patching
+```python
+def patch_hook(act, hook):
+    act[:, pos] = clean_cache[hook.name][:, pos]
+    return act
+patched = model.run_with_hooks(corrupted, fwd_hooks=[(name, patch_hook)])
+# 어느 위치/layer가 차이를 만드는지 인과 측정
+```

-**언제 쓰면 안 되는가:**
- *(TODO)*
+### Pattern 3 — Logit Lens
+```python
+for layer in range(model.cfg.n_layers):
+    resid = cache[f'blocks.{layer}.hook_resid_post']
+    logits = model.unembed(model.ln_final(resid))
+    print(layer, model.to_str_tokens(logits.argmax(-1)[0, -1]))
+```

-## 🧪 검증 상태 (Validation)
+### Pattern 4 — Sparse Autoencoder
+```python
+import torch.nn as nn
+class SAE(nn.Module):
+    def __init__(self, d_model, d_sae):
+        super().__init__()
+        self.W_enc = nn.Linear(d_model, d_sae)
+        self.W_dec = nn.Linear(d_sae, d_model, bias=False)
+    def forward(self, x):
+        f = torch.relu(self.W_enc(x))  # sparse activations
+        return self.W_dec(f), f
+# Loss = recon + λ·||f||_1
+```

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+### Pattern 5 — Feature Attribution (sae-lens)
+```python
+from sae_lens import SAE
+sae = SAE.from_pretrained('gpt2-small-res-jb', 'blocks.8.hook_resid_pre')
+features = sae.encode(activations)
+top_features = features.topk(10, dim=-1)
+```

-## 🧬 중복 검사 (Duplicate Check)
+### Pattern 6 — Causal Steering
+```python
+# Golden Gate Claude 식: feature 활성화 강제
+def steer(act, hook, feature_idx, scale):
+    act += scale * sae.W_dec[feature_idx]
+    return act
+```

- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
+## 매 결정 기준
+| 목표 | 도구 |
+|---|---|
+| 작은 모델 회로 발견 | TransformerLens + activation patching |
+| Feature 추출 (큰 모델) | SAE (sae-lens, dictionary_learning) |
+| 행동 인과성 검증 | Activation patching, ablation |
+| Feature 간 관계 | Attribution graphs / circuit tracing |
+| 안전 alignment | Steering vectors, refusal feature |
+| Production 배포 | 아직 일러 — 연구 단계 |

-## 🕓 변경 이력 (Changelog)
+**기본값**: 2026 기준 SAE + circuit tracing이 메인 파라다임.

-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
+## 🔗 Graph
+- 부모: [[AI-Interpretability]], [[AI-Alignment]]
+- 변형: [[Sparse-Autoencoder]], [[Circuit-Analysis]], [[Activation-Patching]]
+- 응용: [[AI-Safety]], [[Model-Debugging]], [[Refusal-Steering]]
+- Adjacent: [[Transformer-Architecture]], [[Probing]], [[Feature-Visualization]], [[Superposition]], [[Anthropic-Research]]
+
+## 🤖 LLM 활용
+**언제**:
+- 논문 요약 (Anthropic transformer-circuits.pub).
+- TransformerLens / sae-lens 코드 작성.
+- 가설 생성 (어떤 회로가 행동 X를 만드는가?).
+
+**언제 X**:
+- 새로운 mech interp 발견 주장 (실험 필수).
+- 특정 feature ID의 의미 단정 (모델별 다름).
+
+## ❌ 안티패턴
+- Single neuron = single concept 가정 (superposition 무시).
+- Probing 정확도 = 회로 존재 (correlational, 인과 X).
+- Attention 시각화만으로 결론 (MLP가 더 큰 역할 종종).
+- SAE feature = ground truth 가정 (해석은 hypothesis).
+- Toy model 결론을 frontier model에 무비판 외삽.
+
+## 🧪 검증 / 중복
+- Verified. Anthropic 2024-2025 SAE 결과 기준. 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup |