[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -1,64 +1,298 @@
 ---
 id: wiki-2026-0508-circuit-discovery
-title: Circuit Discovery
+title: Circuit Discovery (Mechanistic Interpretability)
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [CIRCUIT-001]
+aliases: [회로 발견, mechanistic interpretability, MI, induction head, activation patching, path patching, ACDC, sparse autoencoder]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 1.0
-tags: [ai-Interpretability, mechanistic-interpretability, neural-networks, circuits]
+confidence_score: 0.88
+verification_status: applied
+tags: [interpretability, mechanistic, circuits, activation-patching, induction-heads, sparse-autoencoder, anthropic, transformer]
 raw_sources: []
-last_reinforced: 2026-04-26
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
+tech_stack:
+  language: Python
+  framework: TransformerLens / nnsight / SAE
 ---

-# [[Circuit Discovery (회로 발견)]]
+# Circuit Discovery

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "거대 모델 속에서 구체적인 기능을 수행하는 작은 알고리즘 지도를 그려라" — 신경망 내부의 특정 뉴런과 헤드들이 어떻게 연결되어 논리적 기능을 수행하는지 식별해내는 기계적 해석 가능성(Mechanistic Interpretability)의 핵심 기법.
+## 📌 한 줄 통찰
+> **"매 LLM 내부 의 algorithm 의 reverse engineer"**. 매 black box 의 X — 매 specific neuron + attention head 의 연결망 의 algorithmic role. 매 modern Anthropic / OpenAI alignment 의 핵심 capability. 매 SAE (Sparse Autoencoder) 의 2024 의 breakthrough.

-## 📖 구조화된 지식 (Synthesized Content)
- **추출된 패턴:** 모델 전체를 블랙박스로 보는 대신, 특정 태스크(예: 간접 목적어 식별)를 수행할 때 활성화되는 최소한의 가중치와 경로를 추출하는 '회로(Circuit)' 식별 패턴.
- **세부 내용:**
-    - **Activation Patching:** 특정 뉴런의 활성화 값을 다른 입력값으로 교체해보며 결과에 미치는 인과적 영향을 측정.
-    - **Path Patching:** 레이어 간의 구체적인 연결 경로를 추적하여 정보가 어떻게 흐르는지(Information Flow) 매핑.
-    - **Induction Heads:** 이전 패턴을 복사하거나 문맥을 이해하는 데 특화된 특정 어텐션 헤드 구조의 발견.
-    - **Automated Circuit Discovery (ACD):** 방대한 파라미터 중 유의미한 연결망을 알고리즘적으로 자동 탐색.
+## 📖 핵심

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** 단순 시각화(Saliency Map) 수준을 넘어, 모델 내부에서 수학적으로 정의 가능한 알고리즘을 찾아내는 정교한 단계로 진화.
- **정책 변화:** 모델의 안전성 검증([[Alignment]])을 위해 잠재적인 유해 논리 회로가 형성되었는지 감지하는 도구로 활용 비중 확대.
+### 매 mechanistic interpretability (MI)
+- 매 neural network 의 specific algorithm 의 identify.
+- 매 saliency map 의 X — 매 mechanism.
+- 매 hypothesis-driven.

-## 🔗 지식 연결 (Graph)
- **Parent:** 10_Wiki/💡 Topics/AI
- **Related:** Mechanistic-Interpretability, Neuron-Attribution, Feature-Visualization
- **Raw Source:** 00_Raw/2026-04-20/Circuit Discovery.md
+### 매 핵심 기법

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+#### Activation Patching
+- 매 specific position 의 activation 의 swap.
+- 매 output change 의 measure.
+- 매 causal effect 의 identify.

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+#### Path Patching
+- 매 specific connection 의 isolate.
+- 매 layer-to-layer information flow.

-**언제 쓰면 안 되는가:**
- *(TODO)*
+#### Logit Lens
+- 매 each layer 의 output 의 unembed.
+- 매 layer 의 prediction evolution.

-## 🧪 검증 상태 (Validation)
+#### Induction Heads (Olsson et al. 2022)
+- 매 [A][B] ... [A] → [B] 의 pattern 의 copy.
+- 매 in-context learning 의 base.
+- 매 layer 2-3 의 typical.

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+#### IOI (Indirect Object Identification)
+- "When Mary and John went to the store, John gave a drink to ___" → Mary.
+- 매 26-head circuit 의 mapped (Wang et al. 2022).

-## 🧬 중복 검사 (Duplicate Check)
+#### Automated Circuit Discovery (ACDC, Conmy et al. 2023)
+- 매 algorithmic search.
+- 매 pruning of insignificant edges.

- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
+#### Sparse Autoencoder (SAE, 2023-2024)
+- 매 monosemantic feature 의 decompose.
+- 매 polysemantic neuron 의 split.
+- 매 Anthropic Sonnet/Sonnet 3.5 의 large-scale.

-## 🕓 변경 이력 (Changelog)
+### 매 Anthropic 의 milestone
+- "A Mathematical Framework for Transformer Circuits" (2021).
+- "In-context Learning and Induction Heads" (2022).
+- "Toy Models of Superposition" (2022).
+- "Towards Monosemanticity" (2023).
+- "Scaling Monosemanticity" (2024) — 매 Claude Sonnet의 30M feature.

-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
+### 매 응용
+1. **Alignment**: 매 deceptive circuit 의 detect.
+2. **Bias**: 매 specific feature 의 identify.
+3. **Steering**: 매 feature 의 amplify / suppress.
+4. **Debugging**: 매 hallucination cause.
+5. **Trust**: 매 capability evaluation.
+6. **Capability discovery**: 매 emergent.
+
+### 매 limitation
+- 매 scaling: 매 large model 의 expensive.
+- 매 polysemanticity: 매 single neuron 의 many concept.
+- 매 superposition: 매 high-dim feature 의 low-dim 의 squeeze.
+- 매 generalization: 매 single task circuit 의 wider behavior?
+
+### 매 modern tool
+- **TransformerLens** (Neel Nanda).
+- **nnsight** (NDIF, Bau lab).
+- **SAELens**: 매 SAE training.
+- **Neuropedia**: 매 feature catalog.
+- **Pyvene**: 매 intervention.
+- **Inseq**: 매 attribution.
+
+## 💻 패턴
+
+### TransformerLens (basic)
+```python
+import transformer_lens
+import torch
+
+model = transformer_lens.HookedTransformer.from_pretrained('gpt2-small')
+
+# 매 hook
+def hook_print(activation, hook):
+    print(f'{hook.name}: shape {activation.shape}')
+
+# 매 forward with hook
+text = 'When Mary and John went to the store, John gave a drink to'
+logits, cache = model.run_with_cache(text, return_type='logits')
+print(cache['blocks.5.attn.hook_z'].shape)  # head outputs
+```
+
+### Activation Patching
+```python
+def activation_patching(model, clean_input, corrupt_input, position, layer, head):
+    """매 corrupt activation 의 clean run 의 patch."""
+    # 매 1. cache corrupt
+    _, corrupt_cache = model.run_with_cache(corrupt_input)
+    
+    # 매 2. patch into clean run
+    def patch_hook(activation, hook):
+        activation[:, position, head] = corrupt_cache[hook.name][:, position, head]
+        return activation
+    
+    patched_logits = model.run_with_hooks(
+        clean_input,
+        fwd_hooks=[(f'blocks.{layer}.attn.hook_z', patch_hook)],
+    )
+    return patched_logits
+
+# 매 effect 의 measure
+clean_logits = model(clean_input)
+patched = activation_patching(model, clean_input, corrupt_input, pos=10, layer=5, head=3)
+effect = (clean_logits - patched)[0, -1, target_token]
+```
+
+### Induction head detection
+```python
+def detect_induction_head(model, text):
+    """매 [A][B]...[A] → [B] 의 head 의 identify."""
+    # 매 random repeating sequence
+    rep = torch.randint(0, model.cfg.d_vocab, (1, 50))
+    rep = torch.cat([rep, rep], dim=1)  # 매 50 + 매 같은 50
+    
+    _, cache = model.run_with_cache(rep)
+    
+    induction_scores = {}
+    for layer in range(model.cfg.n_layers):
+        for head in range(model.cfg.n_heads):
+            # 매 attention from later half 의 매 earlier half + 1 의 score
+            attn = cache[f'blocks.{layer}.attn.hook_pattern'][0, head]
+            # 매 diagonal 의 -49 만큼 의 average
+            score = attn.diagonal(offset=-49).mean()
+            induction_scores[(layer, head)] = score.item()
+    
+    return sorted(induction_scores.items(), key=lambda x: -x[1])[:5]
+```
+
+### Logit Lens
+```python
+def logit_lens(model, text):
+    _, cache = model.run_with_cache(text)
+    
+    layer_predictions = []
+    for layer in range(model.cfg.n_layers):
+        residual = cache[f'blocks.{layer}.hook_resid_post']
+        # 매 unembed
+        logits = model.unembed(model.ln_final(residual))
+        top_token = logits[0, -1].argmax().item()
+        layer_predictions.append((layer, model.tokenizer.decode([top_token])))
+    
+    return layer_predictions
+```
+
+### ACDC (Automated Circuit Discovery)
+```python
+# 매 simplified ACDC
+def acdc_prune(model, task_data, threshold=0.01):
+    """매 edge 의 effect 가 threshold 이하 의 prune."""
+    edges = list_all_edges(model)
+    kept = []
+    
+    baseline_loss = evaluate(model, task_data)
+    
+    for edge in edges:
+        # 매 edge 의 zero-ablate
+        with ablated_edge(model, edge):
+            ablated_loss = evaluate(model, task_data)
+        
+        effect = ablated_loss - baseline_loss
+        if effect > threshold:
+            kept.append((edge, effect))
+    
+    return sorted(kept, key=lambda x: -x[1])
+```
+
+### Sparse Autoencoder (SAE, 2024)
+```python
+import torch.nn as nn
+
+class SAE(nn.Module):
+    """매 dictionary learning 의 transformer activation."""
+    def __init__(self, d_in=4096, d_dict=131072, l1_coeff=1e-3):
+        super().__init__()
+        self.encoder = nn.Linear(d_in, d_dict)
+        self.decoder = nn.Linear(d_dict, d_in, bias=False)
+        self.l1_coeff = l1_coeff
+    
+    def forward(self, x):
+        features = F.relu(self.encoder(x))
+        x_hat = self.decoder(features)
+        
+        recon_loss = (x - x_hat).pow(2).mean()
+        l1_loss = features.abs().mean()  # 매 sparsity
+        
+        return x_hat, features, recon_loss + self.l1_coeff * l1_loss
+
+# 매 train on 매 model activation cache
+sae = SAE(d_in=4096, d_dict=4096*32)
+for batch in activation_loader:
+    x_hat, features, loss = sae(batch)
+    loss.backward()
+    optimizer.step()
+
+# 매 features 의 monosemantic 의 inspect.
+```
+
+### Steering with SAE feature
+```python
+def steer(model, sae, text, feature_idx, scale=5.0):
+    """매 specific SAE feature 의 amplify."""
+    def hook(activation, hook):
+        # 매 SAE encode → 매 modify → 매 decode
+        features = F.relu(sae.encoder(activation))
+        features[:, :, feature_idx] += scale
+        return sae.decoder(features)
+    
+    return model.run_with_hooks(text, fwd_hooks=[('blocks.16.hook_resid_post', hook)])
+```
+
+### Feature visualization (Neuronpedia-style)
+```python
+def find_max_activating(sae, feature_idx, dataset, top_k=10):
+    """매 feature 의 max-activate 매 input."""
+    activations = []
+    for text in dataset:
+        with torch.no_grad():
+            _, cache = model.run_with_cache(text)
+            f = F.relu(sae.encoder(cache['blocks.16.hook_resid_post']))
+            max_act = f[0, :, feature_idx].max().item()
+            activations.append((text, max_act))
+    
+    return sorted(activations, key=lambda x: -x[1])[:top_k]
+```
+
+## 🤔 결정 기준
+| 목적 | Method |
+|---|---|
+| Behavior cause | Activation patching |
+| Information flow | Path patching |
+| Layer evolution | Logit lens |
+| In-context learning | Induction head analysis |
+| Feature discovery | SAE |
+| Algorithmic search | ACDC |
+| Steering | SAE feature manipulation |
+| Capability eval | Probe |
+
+**기본값**: TransformerLens + activation patching 의 baseline. 매 large model = SAE.
+
+## 🔗 Graph
+- 부모: [[Interpretability]] · [[AI-Safety]] · [[Mechanistic-Interpretability]]
+- 변형: [[Activation-Patching]] · [[Path-Patching]] · [[Logit-Lens]] · [[ACDC]] · [[SAE]]
+- 응용: [[Steering]] · [[Feature-Visualization]] · [[Induction-Head]] · [[IOI-Circuit]]
+- Adjacent: [[Anthropic]] · [[OpenAI]] · [[AI-Alignment]] · [[TransformerLens]] · [[Neuronpedia]]
+
+## 🤖 LLM 활용
+**언제**: 매 alignment research. 매 model debugging. 매 capability discovery. 매 steering. 매 trust evaluation.
+**언제 X**: 매 specific task fine-tune. 매 production inference (research domain).
+
+## ❌ 안티패턴
+- **Correlation 만**: 매 causal patching 의 X.
+- **Single circuit 의 generalize**: 매 narrow finding.
+- **No baseline**: 매 random 의 effect 의 detect X.
+- **Polysemanticity ignore**: 매 single neuron 의 many concept.
+- **No SAE on large model**: 매 superposition 의 miss.
+- **Manual only at scale**: 매 100B+ 의 ACDC 의 필요.
+
+## 🧪 검증 / 중복
+- Verified (Anthropic transformer-circuits.pub, Olsson induction heads, Wang IOI, ACDC paper).
+- 신뢰도 A.
+- Related: [[AI-Alignment]] · [[AI-Safety]] · [[Anthropic-Principle]] · [[Sparse-Autoencoder]].
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — MI techniques + SAE + 매 TransformerLens / ACDC / steering code |