2nd/10_Wiki/Topics/AI_and_ML/Mechanistic Interpretability (기계적 해석 가능성).md

---
id: wiki-2026-0508-mechanistic-interpretability-기계적
title: Mechanistic Interpretability (기계적 해석 가능성)
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Mech Interp, Circuit Analysis, MI, 기계적 해석성]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [ai, interpretability, alignment, anthropic, transformer, safety]
raw_sources: [Anthropic Transformer Circuits, Towards Monosemanticity, Scaling Monosemanticity]
last_reinforced: 2026-05-10
github_commit: pending
tech_stack: { language: python, framework: transformer-lens-sae-lens }
---

# Mechanistic Interpretability (기계적 해석 가능성)

## 매 한 줄
> **"매 뉴런을 회로로 읽는다"**. 모델 내부를 black-box 통계가 아니라 명시적 알고리즘(회로/feature)으로 reverse-engineer 하는 분야. Anthropic의 SAE/circuit 연구가 주축.

## 매 핵심
### 매 핵심 개념
- **Circuit**: 특정 행동을 구현하는 attention head + MLP 신경 sub-graph.
- **Feature**: 활성화 공간의 의미 단위 (한 방향 벡터).
- **Polysemanticity**: 한 뉴런이 여러 개념 인코딩 → superposition.
- **SAE (Sparse Autoencoder)**: superposition을 풀어 monosemantic feature 추출.
- **Probing / Logit Lens / Activation Patching**: 진단 도구.

### 매 핵심 발견
1. **Induction heads** (2022) - in-context learning 구현 회로.
2. **IOI circuit** (2022) - Indirect Object Identification.
3. **Toy Models of Superposition** (2022) - feature가 압축되는 이유.
4. **Towards Monosemanticity** (2023) - SAE로 feature 추출 가능.
5. **Scaling Monosemanticity** (2024) - Claude 3 Sonnet에 SAE 적용, "Golden Gate Bridge feature" 등.
6. **Circuit Tracing / Attribution Graphs** (2025) - feature 간 인과 추적.

## 💻 패턴

### Pattern 1 — TransformerLens (회로 분석)
```python
import transformer_lens as tl
model = tl.HookedTransformer.from_pretrained('gpt2-small')
logits, cache = model.run_with_cache("The capital of France is")
# cache['blocks.5.attn.hook_pattern'] - attention 패턴 검사
```

### Pattern 2 — Activation Patching
```python
def patch_hook(act, hook):
    act[:, pos] = clean_cache[hook.name][:, pos]
    return act
patched = model.run_with_hooks(corrupted, fwd_hooks=[(name, patch_hook)])
# 어느 위치/layer가 차이를 만드는지 인과 측정
```

### Pattern 3 — Logit Lens
```python
for layer in range(model.cfg.n_layers):
    resid = cache[f'blocks.{layer}.hook_resid_post']
    logits = model.unembed(model.ln_final(resid))
    print(layer, model.to_str_tokens(logits.argmax(-1)[0, -1]))
```

### Pattern 4 — Sparse Autoencoder
```python
import torch.nn as nn
class SAE(nn.Module):
    def __init__(self, d_model, d_sae):
        super().__init__()
        self.W_enc = nn.Linear(d_model, d_sae)
        self.W_dec = nn.Linear(d_sae, d_model, bias=False)
    def forward(self, x):
        f = torch.relu(self.W_enc(x))  # sparse activations
        return self.W_dec(f), f
# Loss = recon + λ·||f||_1
```

### Pattern 5 — Feature Attribution (sae-lens)
```python
from sae_lens import SAE
sae = SAE.from_pretrained('gpt2-small-res-jb', 'blocks.8.hook_resid_pre')
features = sae.encode(activations)
top_features = features.topk(10, dim=-1)
```

### Pattern 6 — Causal Steering
```python
# Golden Gate Claude 식: feature 활성화 강제
def steer(act, hook, feature_idx, scale):
    act += scale * sae.W_dec[feature_idx]
    return act
```

## 매 결정 기준
| 목표 | 도구 |
|---|---|
| 작은 모델 회로 발견 | TransformerLens + activation patching |
| Feature 추출 (큰 모델) | SAE (sae-lens, dictionary_learning) |
| 행동 인과성 검증 | Activation patching, ablation |
| Feature 간 관계 | Attribution graphs / circuit tracing |
| 안전 alignment | Steering vectors, refusal feature |
| Production 배포 | 아직 일러 — 연구 단계 |

**기본값**: 2026 기준 SAE + circuit tracing이 메인 파라다임.

## 🔗 Graph
- 부모: [[AI-Interpretability]], [[AI-Alignment]]
- 변형: [[Sparse-Autoencoder]], [[Circuit-Analysis]], [[Activation-Patching]]
- 응용: [[AI-Safety]], [[Model-Debugging]], [[Refusal-Steering]]
- Adjacent: [[Transformer-Architecture]], [[Probing]], [[Feature-Visualization]], [[Superposition]], [[Anthropic-Research]]

## 🤖 LLM 활용
**언제**:
- 논문 요약 (Anthropic transformer-circuits.pub).
- TransformerLens / sae-lens 코드 작성.
- 가설 생성 (어떤 회로가 행동 X를 만드는가?).

**언제 X**:
- 새로운 mech interp 발견 주장 (실험 필수).
- 특정 feature ID의 의미 단정 (모델별 다름).

## ❌ 안티패턴
- Single neuron = single concept 가정 (superposition 무시).
- Probing 정확도 = 회로 존재 (correlational, 인과 X).
- Attention 시각화만으로 결론 (MLP가 더 큰 역할 종종).
- SAE feature = ground truth 가정 (해석은 hypothesis).
- Toy model 결론을 frontier model에 무비판 외삽.

## 🧪 검증 / 중복
- Verified. Anthropic 2024-2025 SAE 결과 기준. 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup |