--- id: wiki-2026-0508-circuit-discovery title: Circuit Discovery (Mechanistic Interpretability) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [회로 발견, mechanistic interpretability, MI, induction head, activation patching, path patching, ACDC, sparse autoencoder] duplicate_of: none source_trust_level: A confidence_score: 0.88 verification_status: applied tags: [interpretability, mechanistic, circuits, activation-patching, induction-heads, sparse-autoencoder, anthropic, transformer] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: TransformerLens / nnsight / SAE --- # Circuit Discovery ## 📌 한 줄 통찰 > **"매 LLM 내부 의 algorithm 의 reverse engineer"**. 매 black box 의 X — 매 specific neuron + attention head 의 연결망 의 algorithmic role. 매 modern Anthropic / OpenAI alignment 의 핵심 capability. 매 SAE (Sparse Autoencoder) 의 2024 의 breakthrough. ## 📖 핵심 ### 매 mechanistic interpretability (MI) - 매 neural network 의 specific algorithm 의 identify. - 매 saliency map 의 X — 매 mechanism. - 매 hypothesis-driven. ### 매 핵심 기법 #### Activation Patching - 매 specific position 의 activation 의 swap. - 매 output change 의 measure. - 매 causal effect 의 identify. #### Path Patching - 매 specific connection 의 isolate. - 매 layer-to-layer information flow. #### Logit Lens - 매 each layer 의 output 의 unembed. - 매 layer 의 prediction evolution. #### Induction Heads (Olsson et al. 2022) - 매 [A][B] ... [A] → [B] 의 pattern 의 copy. - 매 in-context learning 의 base. - 매 layer 2-3 의 typical. #### IOI (Indirect Object Identification) - "When Mary and John went to the store, John gave a drink to ___" → Mary. - 매 26-head circuit 의 mapped (Wang et al. 2022). #### Automated Circuit Discovery (ACDC, Conmy et al. 2023) - 매 algorithmic search. - 매 pruning of insignificant edges. #### Sparse Autoencoder (SAE, 2023-2024) - 매 monosemantic feature 의 decompose. - 매 polysemantic neuron 의 split. - 매 Anthropic Sonnet/Sonnet 3.5 의 large-scale. ### 매 Anthropic 의 milestone - "A Mathematical Framework for Transformer Circuits" (2021). - "In-context Learning and Induction Heads" (2022). - "Toy Models of Superposition" (2022). - "Towards Monosemanticity" (2023). - "Scaling Monosemanticity" (2024) — 매 Claude Sonnet의 30M feature. ### 매 응용 1. **Alignment**: 매 deceptive circuit 의 detect. 2. **Bias**: 매 specific feature 의 identify. 3. **Steering**: 매 feature 의 amplify / suppress. 4. **Debugging**: 매 hallucination cause. 5. **Trust**: 매 capability evaluation. 6. **Capability discovery**: 매 emergent. ### 매 limitation - 매 scaling: 매 large model 의 expensive. - 매 polysemanticity: 매 single neuron 의 many concept. - 매 superposition: 매 high-dim feature 의 low-dim 의 squeeze. - 매 generalization: 매 single task circuit 의 wider behavior? ### 매 modern tool - **TransformerLens** (Neel Nanda). - **nnsight** (NDIF, Bau lab). - **SAELens**: 매 SAE training. - **Neuropedia**: 매 feature catalog. - **Pyvene**: 매 intervention. - **Inseq**: 매 attribution. ## 💻 패턴 ### TransformerLens (basic) ```python import transformer_lens import torch model = transformer_lens.HookedTransformer.from_pretrained('gpt2-small') # 매 hook def hook_print(activation, hook): print(f'{hook.name}: shape {activation.shape}') # 매 forward with hook text = 'When Mary and John went to the store, John gave a drink to' logits, cache = model.run_with_cache(text, return_type='logits') print(cache['blocks.5.attn.hook_z'].shape) # head outputs ``` ### Activation Patching ```python def activation_patching(model, clean_input, corrupt_input, position, layer, head): """매 corrupt activation 의 clean run 의 patch.""" # 매 1. cache corrupt _, corrupt_cache = model.run_with_cache(corrupt_input) # 매 2. patch into clean run def patch_hook(activation, hook): activation[:, position, head] = corrupt_cache[hook.name][:, position, head] return activation patched_logits = model.run_with_hooks( clean_input, fwd_hooks=[(f'blocks.{layer}.attn.hook_z', patch_hook)], ) return patched_logits # 매 effect 의 measure clean_logits = model(clean_input) patched = activation_patching(model, clean_input, corrupt_input, pos=10, layer=5, head=3) effect = (clean_logits - patched)[0, -1, target_token] ``` ### Induction head detection ```python def detect_induction_head(model, text): """매 [A][B]...[A] → [B] 의 head 의 identify.""" # 매 random repeating sequence rep = torch.randint(0, model.cfg.d_vocab, (1, 50)) rep = torch.cat([rep, rep], dim=1) # 매 50 + 매 같은 50 _, cache = model.run_with_cache(rep) induction_scores = {} for layer in range(model.cfg.n_layers): for head in range(model.cfg.n_heads): # 매 attention from later half 의 매 earlier half + 1 의 score attn = cache[f'blocks.{layer}.attn.hook_pattern'][0, head] # 매 diagonal 의 -49 만큼 의 average score = attn.diagonal(offset=-49).mean() induction_scores[(layer, head)] = score.item() return sorted(induction_scores.items(), key=lambda x: -x[1])[:5] ``` ### Logit Lens ```python def logit_lens(model, text): _, cache = model.run_with_cache(text) layer_predictions = [] for layer in range(model.cfg.n_layers): residual = cache[f'blocks.{layer}.hook_resid_post'] # 매 unembed logits = model.unembed(model.ln_final(residual)) top_token = logits[0, -1].argmax().item() layer_predictions.append((layer, model.tokenizer.decode([top_token]))) return layer_predictions ``` ### ACDC (Automated Circuit Discovery) ```python # 매 simplified ACDC def acdc_prune(model, task_data, threshold=0.01): """매 edge 의 effect 가 threshold 이하 의 prune.""" edges = list_all_edges(model) kept = [] baseline_loss = evaluate(model, task_data) for edge in edges: # 매 edge 의 zero-ablate with ablated_edge(model, edge): ablated_loss = evaluate(model, task_data) effect = ablated_loss - baseline_loss if effect > threshold: kept.append((edge, effect)) return sorted(kept, key=lambda x: -x[1]) ``` ### Sparse Autoencoder (SAE, 2024) ```python import torch.nn as nn class SAE(nn.Module): """매 dictionary learning 의 transformer activation.""" def __init__(self, d_in=4096, d_dict=131072, l1_coeff=1e-3): super().__init__() self.encoder = nn.Linear(d_in, d_dict) self.decoder = nn.Linear(d_dict, d_in, bias=False) self.l1_coeff = l1_coeff def forward(self, x): features = F.relu(self.encoder(x)) x_hat = self.decoder(features) recon_loss = (x - x_hat).pow(2).mean() l1_loss = features.abs().mean() # 매 sparsity return x_hat, features, recon_loss + self.l1_coeff * l1_loss # 매 train on 매 model activation cache sae = SAE(d_in=4096, d_dict=4096*32) for batch in activation_loader: x_hat, features, loss = sae(batch) loss.backward() optimizer.step() # 매 features 의 monosemantic 의 inspect. ``` ### Steering with SAE feature ```python def steer(model, sae, text, feature_idx, scale=5.0): """매 specific SAE feature 의 amplify.""" def hook(activation, hook): # 매 SAE encode → 매 modify → 매 decode features = F.relu(sae.encoder(activation)) features[:, :, feature_idx] += scale return sae.decoder(features) return model.run_with_hooks(text, fwd_hooks=[('blocks.16.hook_resid_post', hook)]) ``` ### Feature visualization (Neuronpedia-style) ```python def find_max_activating(sae, feature_idx, dataset, top_k=10): """매 feature 의 max-activate 매 input.""" activations = [] for text in dataset: with torch.no_grad(): _, cache = model.run_with_cache(text) f = F.relu(sae.encoder(cache['blocks.16.hook_resid_post'])) max_act = f[0, :, feature_idx].max().item() activations.append((text, max_act)) return sorted(activations, key=lambda x: -x[1])[:top_k] ``` ## 🤔 결정 기준 | 목적 | Method | |---|---| | Behavior cause | Activation patching | | Information flow | Path patching | | Layer evolution | Logit lens | | In-context learning | Induction head analysis | | Feature discovery | SAE | | Algorithmic search | ACDC | | Steering | SAE feature manipulation | | Capability eval | Probe | **기본값**: TransformerLens + activation patching 의 baseline. 매 large model = SAE. ## 🔗 Graph - 부모: [[Interpretability]] · [[AI Safety]] · [[Mechanistic-Interpretability]] - 변형: [[Activation-Patching]] · [[Path-Patching]] · [[ACDC]] - 응용: [[Steering]] · [[Induction-Head]] - Adjacent: [[Anthropic]] · [[AI_Safety_and_Alignment|AI-Alignment]] ## 🤖 LLM 활용 **언제**: 매 alignment research. 매 model debugging. 매 capability discovery. 매 steering. 매 trust evaluation. **언제 X**: 매 specific task fine-tune. 매 production inference (research domain). ## ❌ 안티패턴 - **Correlation 만**: 매 causal patching 의 X. - **Single circuit 의 generalize**: 매 narrow finding. - **No baseline**: 매 random 의 effect 의 detect X. - **Polysemanticity ignore**: 매 single neuron 의 many concept. - **No SAE on large model**: 매 superposition 의 miss. - **Manual only at scale**: 매 100B+ 의 ACDC 의 필요. ## 🧪 검증 / 중복 - Verified (Anthropic transformer-circuits.pub, Olsson induction heads, Wang IOI, ACDC paper). - 신뢰도 A. - Related: [[AI_Safety_and_Alignment|AI-Alignment]] · [[AI Safety]] · [[Anthropic-Principle]] · [[Sparse-Autoencoder]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — MI techniques + SAE + 매 TransformerLens / ACDC / steering code |