Files
2nd/10_Wiki/Topics/AI_and_ML/Circuit Discovery.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

9.9 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-circuit-discovery Circuit Discovery (Mechanistic Interpretability) 10_Wiki/Topics verified self
회로 발견
mechanistic interpretability
MI
induction head
activation patching
path patching
ACDC
sparse autoencoder
none A 0.88 applied
interpretability
mechanistic
circuits
activation-patching
induction-heads
sparse-autoencoder
anthropic
transformer
2026-05-10 pending
language framework
Python TransformerLens / nnsight / SAE

Circuit Discovery

📌 한 줄 통찰

"매 LLM 내부 의 algorithm 의 reverse engineer". 매 black box 의 X — 매 specific neuron + attention head 의 연결망 의 algorithmic role. 매 modern Anthropic / OpenAI alignment 의 핵심 capability. 매 SAE (Sparse Autoencoder) 의 2024 의 breakthrough.

📖 핵심

매 mechanistic interpretability (MI)

  • 매 neural network 의 specific algorithm 의 identify.
  • 매 saliency map 의 X — 매 mechanism.
  • 매 hypothesis-driven.

매 핵심 기법

Activation Patching

  • 매 specific position 의 activation 의 swap.
  • 매 output change 의 measure.
  • 매 causal effect 의 identify.

Path Patching

  • 매 specific connection 의 isolate.
  • 매 layer-to-layer information flow.

Logit Lens

  • 매 each layer 의 output 의 unembed.
  • 매 layer 의 prediction evolution.

Induction Heads (Olsson et al. 2022)

  • 매 [A][B] ... [A] → [B] 의 pattern 의 copy.
  • 매 in-context learning 의 base.
  • 매 layer 2-3 의 typical.

IOI (Indirect Object Identification)

  • "When Mary and John went to the store, John gave a drink to ___" → Mary.
  • 매 26-head circuit 의 mapped (Wang et al. 2022).

Automated Circuit Discovery (ACDC, Conmy et al. 2023)

  • 매 algorithmic search.
  • 매 pruning of insignificant edges.

Sparse Autoencoder (SAE, 2023-2024)

  • 매 monosemantic feature 의 decompose.
  • 매 polysemantic neuron 의 split.
  • 매 Anthropic Sonnet/Sonnet 3.5 의 large-scale.

매 Anthropic 의 milestone

  • "A Mathematical Framework for Transformer Circuits" (2021).
  • "In-context Learning and Induction Heads" (2022).
  • "Toy Models of Superposition" (2022).
  • "Towards Monosemanticity" (2023).
  • "Scaling Monosemanticity" (2024) — 매 Claude Sonnet의 30M feature.

매 응용

  1. Alignment: 매 deceptive circuit 의 detect.
  2. Bias: 매 specific feature 의 identify.
  3. Steering: 매 feature 의 amplify / suppress.
  4. Debugging: 매 hallucination cause.
  5. Trust: 매 capability evaluation.
  6. Capability discovery: 매 emergent.

매 limitation

  • 매 scaling: 매 large model 의 expensive.
  • 매 polysemanticity: 매 single neuron 의 many concept.
  • 매 superposition: 매 high-dim feature 의 low-dim 의 squeeze.
  • 매 generalization: 매 single task circuit 의 wider behavior?

매 modern tool

  • TransformerLens (Neel Nanda).
  • nnsight (NDIF, Bau lab).
  • SAELens: 매 SAE training.
  • Neuropedia: 매 feature catalog.
  • Pyvene: 매 intervention.
  • Inseq: 매 attribution.

💻 패턴

TransformerLens (basic)

import transformer_lens
import torch

model = transformer_lens.HookedTransformer.from_pretrained('gpt2-small')

# 매 hook
def hook_print(activation, hook):
    print(f'{hook.name}: shape {activation.shape}')

# 매 forward with hook
text = 'When Mary and John went to the store, John gave a drink to'
logits, cache = model.run_with_cache(text, return_type='logits')
print(cache['blocks.5.attn.hook_z'].shape)  # head outputs

Activation Patching

def activation_patching(model, clean_input, corrupt_input, position, layer, head):
    """매 corrupt activation 의 clean run 의 patch."""
    # 매 1. cache corrupt
    _, corrupt_cache = model.run_with_cache(corrupt_input)
    
    # 매 2. patch into clean run
    def patch_hook(activation, hook):
        activation[:, position, head] = corrupt_cache[hook.name][:, position, head]
        return activation
    
    patched_logits = model.run_with_hooks(
        clean_input,
        fwd_hooks=[(f'blocks.{layer}.attn.hook_z', patch_hook)],
    )
    return patched_logits

# 매 effect 의 measure
clean_logits = model(clean_input)
patched = activation_patching(model, clean_input, corrupt_input, pos=10, layer=5, head=3)
effect = (clean_logits - patched)[0, -1, target_token]

Induction head detection

def detect_induction_head(model, text):
    """매 [A][B]...[A] → [B] 의 head 의 identify."""
    # 매 random repeating sequence
    rep = torch.randint(0, model.cfg.d_vocab, (1, 50))
    rep = torch.cat([rep, rep], dim=1)  # 매 50 + 매 같은 50
    
    _, cache = model.run_with_cache(rep)
    
    induction_scores = {}
    for layer in range(model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            # 매 attention from later half 의 매 earlier half + 1 의 score
            attn = cache[f'blocks.{layer}.attn.hook_pattern'][0, head]
            # 매 diagonal 의 -49 만큼 의 average
            score = attn.diagonal(offset=-49).mean()
            induction_scores[(layer, head)] = score.item()
    
    return sorted(induction_scores.items(), key=lambda x: -x[1])[:5]

Logit Lens

def logit_lens(model, text):
    _, cache = model.run_with_cache(text)
    
    layer_predictions = []
    for layer in range(model.cfg.n_layers):
        residual = cache[f'blocks.{layer}.hook_resid_post']
        # 매 unembed
        logits = model.unembed(model.ln_final(residual))
        top_token = logits[0, -1].argmax().item()
        layer_predictions.append((layer, model.tokenizer.decode([top_token])))
    
    return layer_predictions

ACDC (Automated Circuit Discovery)

# 매 simplified ACDC
def acdc_prune(model, task_data, threshold=0.01):
    """매 edge 의 effect 가 threshold 이하 의 prune."""
    edges = list_all_edges(model)
    kept = []
    
    baseline_loss = evaluate(model, task_data)
    
    for edge in edges:
        # 매 edge 의 zero-ablate
        with ablated_edge(model, edge):
            ablated_loss = evaluate(model, task_data)
        
        effect = ablated_loss - baseline_loss
        if effect > threshold:
            kept.append((edge, effect))
    
    return sorted(kept, key=lambda x: -x[1])

Sparse Autoencoder (SAE, 2024)

import torch.nn as nn

class SAE(nn.Module):
    """매 dictionary learning 의 transformer activation."""
    def __init__(self, d_in=4096, d_dict=131072, l1_coeff=1e-3):
        super().__init__()
        self.encoder = nn.Linear(d_in, d_dict)
        self.decoder = nn.Linear(d_dict, d_in, bias=False)
        self.l1_coeff = l1_coeff
    
    def forward(self, x):
        features = F.relu(self.encoder(x))
        x_hat = self.decoder(features)
        
        recon_loss = (x - x_hat).pow(2).mean()
        l1_loss = features.abs().mean()  # 매 sparsity
        
        return x_hat, features, recon_loss + self.l1_coeff * l1_loss

# 매 train on 매 model activation cache
sae = SAE(d_in=4096, d_dict=4096*32)
for batch in activation_loader:
    x_hat, features, loss = sae(batch)
    loss.backward()
    optimizer.step()

# 매 features 의 monosemantic 의 inspect.

Steering with SAE feature

def steer(model, sae, text, feature_idx, scale=5.0):
    """매 specific SAE feature 의 amplify."""
    def hook(activation, hook):
        # 매 SAE encode → 매 modify → 매 decode
        features = F.relu(sae.encoder(activation))
        features[:, :, feature_idx] += scale
        return sae.decoder(features)
    
    return model.run_with_hooks(text, fwd_hooks=[('blocks.16.hook_resid_post', hook)])

Feature visualization (Neuronpedia-style)

def find_max_activating(sae, feature_idx, dataset, top_k=10):
    """매 feature 의 max-activate 매 input."""
    activations = []
    for text in dataset:
        with torch.no_grad():
            _, cache = model.run_with_cache(text)
            f = F.relu(sae.encoder(cache['blocks.16.hook_resid_post']))
            max_act = f[0, :, feature_idx].max().item()
            activations.append((text, max_act))
    
    return sorted(activations, key=lambda x: -x[1])[:top_k]

🤔 결정 기준

목적 Method
Behavior cause Activation patching
Information flow Path patching
Layer evolution Logit lens
In-context learning Induction head analysis
Feature discovery SAE
Algorithmic search ACDC
Steering SAE feature manipulation
Capability eval Probe

기본값: TransformerLens + activation patching 의 baseline. 매 large model = SAE.

🔗 Graph

🤖 LLM 활용

언제: 매 alignment research. 매 model debugging. 매 capability discovery. 매 steering. 매 trust evaluation. 언제 X: 매 specific task fine-tune. 매 production inference (research domain).

안티패턴

  • Correlation 만: 매 causal patching 의 X.
  • Single circuit 의 generalize: 매 narrow finding.
  • No baseline: 매 random 의 effect 의 detect X.
  • Polysemanticity ignore: 매 single neuron 의 many concept.
  • No SAE on large model: 매 superposition 의 miss.
  • Manual only at scale: 매 100B+ 의 ACDC 의 필요.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — MI techniques + SAE + 매 TransformerLens / ACDC / steering code