d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
299 lines
9.9 KiB
Markdown
299 lines
9.9 KiB
Markdown
---
|
|
id: wiki-2026-0508-circuit-discovery
|
|
title: Circuit Discovery (Mechanistic Interpretability)
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [회로 발견, mechanistic interpretability, MI, induction head, activation patching, path patching, ACDC, sparse autoencoder]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.88
|
|
verification_status: applied
|
|
tags: [interpretability, mechanistic, circuits, activation-patching, induction-heads, sparse-autoencoder, anthropic, transformer]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: Python
|
|
framework: TransformerLens / nnsight / SAE
|
|
---
|
|
|
|
# Circuit Discovery
|
|
|
|
## 📌 한 줄 통찰
|
|
> **"매 LLM 내부 의 algorithm 의 reverse engineer"**. 매 black box 의 X — 매 specific neuron + attention head 의 연결망 의 algorithmic role. 매 modern Anthropic / OpenAI alignment 의 핵심 capability. 매 SAE (Sparse Autoencoder) 의 2024 의 breakthrough.
|
|
|
|
## 📖 핵심
|
|
|
|
### 매 mechanistic interpretability (MI)
|
|
- 매 neural network 의 specific algorithm 의 identify.
|
|
- 매 saliency map 의 X — 매 mechanism.
|
|
- 매 hypothesis-driven.
|
|
|
|
### 매 핵심 기법
|
|
|
|
#### Activation Patching
|
|
- 매 specific position 의 activation 의 swap.
|
|
- 매 output change 의 measure.
|
|
- 매 causal effect 의 identify.
|
|
|
|
#### Path Patching
|
|
- 매 specific connection 의 isolate.
|
|
- 매 layer-to-layer information flow.
|
|
|
|
#### Logit Lens
|
|
- 매 each layer 의 output 의 unembed.
|
|
- 매 layer 의 prediction evolution.
|
|
|
|
#### Induction Heads (Olsson et al. 2022)
|
|
- 매 [A][B] ... [A] → [B] 의 pattern 의 copy.
|
|
- 매 in-context learning 의 base.
|
|
- 매 layer 2-3 의 typical.
|
|
|
|
#### IOI (Indirect Object Identification)
|
|
- "When Mary and John went to the store, John gave a drink to ___" → Mary.
|
|
- 매 26-head circuit 의 mapped (Wang et al. 2022).
|
|
|
|
#### Automated Circuit Discovery (ACDC, Conmy et al. 2023)
|
|
- 매 algorithmic search.
|
|
- 매 pruning of insignificant edges.
|
|
|
|
#### Sparse Autoencoder (SAE, 2023-2024)
|
|
- 매 monosemantic feature 의 decompose.
|
|
- 매 polysemantic neuron 의 split.
|
|
- 매 Anthropic Sonnet/Sonnet 3.5 의 large-scale.
|
|
|
|
### 매 Anthropic 의 milestone
|
|
- "A Mathematical Framework for Transformer Circuits" (2021).
|
|
- "In-context Learning and Induction Heads" (2022).
|
|
- "Toy Models of Superposition" (2022).
|
|
- "Towards Monosemanticity" (2023).
|
|
- "Scaling Monosemanticity" (2024) — 매 Claude Sonnet의 30M feature.
|
|
|
|
### 매 응용
|
|
1. **Alignment**: 매 deceptive circuit 의 detect.
|
|
2. **Bias**: 매 specific feature 의 identify.
|
|
3. **Steering**: 매 feature 의 amplify / suppress.
|
|
4. **Debugging**: 매 hallucination cause.
|
|
5. **Trust**: 매 capability evaluation.
|
|
6. **Capability discovery**: 매 emergent.
|
|
|
|
### 매 limitation
|
|
- 매 scaling: 매 large model 의 expensive.
|
|
- 매 polysemanticity: 매 single neuron 의 many concept.
|
|
- 매 superposition: 매 high-dim feature 의 low-dim 의 squeeze.
|
|
- 매 generalization: 매 single task circuit 의 wider behavior?
|
|
|
|
### 매 modern tool
|
|
- **TransformerLens** (Neel Nanda).
|
|
- **nnsight** (NDIF, Bau lab).
|
|
- **SAELens**: 매 SAE training.
|
|
- **Neuropedia**: 매 feature catalog.
|
|
- **Pyvene**: 매 intervention.
|
|
- **Inseq**: 매 attribution.
|
|
|
|
## 💻 패턴
|
|
|
|
### TransformerLens (basic)
|
|
```python
|
|
import transformer_lens
|
|
import torch
|
|
|
|
model = transformer_lens.HookedTransformer.from_pretrained('gpt2-small')
|
|
|
|
# 매 hook
|
|
def hook_print(activation, hook):
|
|
print(f'{hook.name}: shape {activation.shape}')
|
|
|
|
# 매 forward with hook
|
|
text = 'When Mary and John went to the store, John gave a drink to'
|
|
logits, cache = model.run_with_cache(text, return_type='logits')
|
|
print(cache['blocks.5.attn.hook_z'].shape) # head outputs
|
|
```
|
|
|
|
### Activation Patching
|
|
```python
|
|
def activation_patching(model, clean_input, corrupt_input, position, layer, head):
|
|
"""매 corrupt activation 의 clean run 의 patch."""
|
|
# 매 1. cache corrupt
|
|
_, corrupt_cache = model.run_with_cache(corrupt_input)
|
|
|
|
# 매 2. patch into clean run
|
|
def patch_hook(activation, hook):
|
|
activation[:, position, head] = corrupt_cache[hook.name][:, position, head]
|
|
return activation
|
|
|
|
patched_logits = model.run_with_hooks(
|
|
clean_input,
|
|
fwd_hooks=[(f'blocks.{layer}.attn.hook_z', patch_hook)],
|
|
)
|
|
return patched_logits
|
|
|
|
# 매 effect 의 measure
|
|
clean_logits = model(clean_input)
|
|
patched = activation_patching(model, clean_input, corrupt_input, pos=10, layer=5, head=3)
|
|
effect = (clean_logits - patched)[0, -1, target_token]
|
|
```
|
|
|
|
### Induction head detection
|
|
```python
|
|
def detect_induction_head(model, text):
|
|
"""매 [A][B]...[A] → [B] 의 head 의 identify."""
|
|
# 매 random repeating sequence
|
|
rep = torch.randint(0, model.cfg.d_vocab, (1, 50))
|
|
rep = torch.cat([rep, rep], dim=1) # 매 50 + 매 같은 50
|
|
|
|
_, cache = model.run_with_cache(rep)
|
|
|
|
induction_scores = {}
|
|
for layer in range(model.cfg.n_layers):
|
|
for head in range(model.cfg.n_heads):
|
|
# 매 attention from later half 의 매 earlier half + 1 의 score
|
|
attn = cache[f'blocks.{layer}.attn.hook_pattern'][0, head]
|
|
# 매 diagonal 의 -49 만큼 의 average
|
|
score = attn.diagonal(offset=-49).mean()
|
|
induction_scores[(layer, head)] = score.item()
|
|
|
|
return sorted(induction_scores.items(), key=lambda x: -x[1])[:5]
|
|
```
|
|
|
|
### Logit Lens
|
|
```python
|
|
def logit_lens(model, text):
|
|
_, cache = model.run_with_cache(text)
|
|
|
|
layer_predictions = []
|
|
for layer in range(model.cfg.n_layers):
|
|
residual = cache[f'blocks.{layer}.hook_resid_post']
|
|
# 매 unembed
|
|
logits = model.unembed(model.ln_final(residual))
|
|
top_token = logits[0, -1].argmax().item()
|
|
layer_predictions.append((layer, model.tokenizer.decode([top_token])))
|
|
|
|
return layer_predictions
|
|
```
|
|
|
|
### ACDC (Automated Circuit Discovery)
|
|
```python
|
|
# 매 simplified ACDC
|
|
def acdc_prune(model, task_data, threshold=0.01):
|
|
"""매 edge 의 effect 가 threshold 이하 의 prune."""
|
|
edges = list_all_edges(model)
|
|
kept = []
|
|
|
|
baseline_loss = evaluate(model, task_data)
|
|
|
|
for edge in edges:
|
|
# 매 edge 의 zero-ablate
|
|
with ablated_edge(model, edge):
|
|
ablated_loss = evaluate(model, task_data)
|
|
|
|
effect = ablated_loss - baseline_loss
|
|
if effect > threshold:
|
|
kept.append((edge, effect))
|
|
|
|
return sorted(kept, key=lambda x: -x[1])
|
|
```
|
|
|
|
### Sparse Autoencoder (SAE, 2024)
|
|
```python
|
|
import torch.nn as nn
|
|
|
|
class SAE(nn.Module):
|
|
"""매 dictionary learning 의 transformer activation."""
|
|
def __init__(self, d_in=4096, d_dict=131072, l1_coeff=1e-3):
|
|
super().__init__()
|
|
self.encoder = nn.Linear(d_in, d_dict)
|
|
self.decoder = nn.Linear(d_dict, d_in, bias=False)
|
|
self.l1_coeff = l1_coeff
|
|
|
|
def forward(self, x):
|
|
features = F.relu(self.encoder(x))
|
|
x_hat = self.decoder(features)
|
|
|
|
recon_loss = (x - x_hat).pow(2).mean()
|
|
l1_loss = features.abs().mean() # 매 sparsity
|
|
|
|
return x_hat, features, recon_loss + self.l1_coeff * l1_loss
|
|
|
|
# 매 train on 매 model activation cache
|
|
sae = SAE(d_in=4096, d_dict=4096*32)
|
|
for batch in activation_loader:
|
|
x_hat, features, loss = sae(batch)
|
|
loss.backward()
|
|
optimizer.step()
|
|
|
|
# 매 features 의 monosemantic 의 inspect.
|
|
```
|
|
|
|
### Steering with SAE feature
|
|
```python
|
|
def steer(model, sae, text, feature_idx, scale=5.0):
|
|
"""매 specific SAE feature 의 amplify."""
|
|
def hook(activation, hook):
|
|
# 매 SAE encode → 매 modify → 매 decode
|
|
features = F.relu(sae.encoder(activation))
|
|
features[:, :, feature_idx] += scale
|
|
return sae.decoder(features)
|
|
|
|
return model.run_with_hooks(text, fwd_hooks=[('blocks.16.hook_resid_post', hook)])
|
|
```
|
|
|
|
### Feature visualization (Neuronpedia-style)
|
|
```python
|
|
def find_max_activating(sae, feature_idx, dataset, top_k=10):
|
|
"""매 feature 의 max-activate 매 input."""
|
|
activations = []
|
|
for text in dataset:
|
|
with torch.no_grad():
|
|
_, cache = model.run_with_cache(text)
|
|
f = F.relu(sae.encoder(cache['blocks.16.hook_resid_post']))
|
|
max_act = f[0, :, feature_idx].max().item()
|
|
activations.append((text, max_act))
|
|
|
|
return sorted(activations, key=lambda x: -x[1])[:top_k]
|
|
```
|
|
|
|
## 🤔 결정 기준
|
|
| 목적 | Method |
|
|
|---|---|
|
|
| Behavior cause | Activation patching |
|
|
| Information flow | Path patching |
|
|
| Layer evolution | Logit lens |
|
|
| In-context learning | Induction head analysis |
|
|
| Feature discovery | SAE |
|
|
| Algorithmic search | ACDC |
|
|
| Steering | SAE feature manipulation |
|
|
| Capability eval | Probe |
|
|
|
|
**기본값**: TransformerLens + activation patching 의 baseline. 매 large model = SAE.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Interpretability]] · [[AI Safety]] · [[Mechanistic-Interpretability]]
|
|
- 변형: [[Activation-Patching]] · [[Path-Patching]] · [[ACDC]]
|
|
- 응용: [[Steering]] · [[Induction-Head]]
|
|
- Adjacent: [[Anthropic]] · [[AI_Safety_and_Alignment|AI-Alignment]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 alignment research. 매 model debugging. 매 capability discovery. 매 steering. 매 trust evaluation.
|
|
**언제 X**: 매 specific task fine-tune. 매 production inference (research domain).
|
|
|
|
## ❌ 안티패턴
|
|
- **Correlation 만**: 매 causal patching 의 X.
|
|
- **Single circuit 의 generalize**: 매 narrow finding.
|
|
- **No baseline**: 매 random 의 effect 의 detect X.
|
|
- **Polysemanticity ignore**: 매 single neuron 의 many concept.
|
|
- **No SAE on large model**: 매 superposition 의 miss.
|
|
- **Manual only at scale**: 매 100B+ 의 ACDC 의 필요.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Anthropic transformer-circuits.pub, Olsson induction heads, Wang IOI, ACDC paper).
|
|
- 신뢰도 A.
|
|
- Related: [[AI_Safety_and_Alignment|AI-Alignment]] · [[AI Safety]] · [[Anthropic-Principle]] · [[Sparse-Autoencoder]].
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — MI techniques + SAE + 매 TransformerLens / ACDC / steering code |
|