--- id: CIRCUIT-001 category: "10_Wiki/πŸ’‘ Topics/AI" confidence_score: 1.0 tags: [ai-interpretability, mechanistic-interpretability, neural-networks, circuits] last_reinforced: 2026-04-26 --- # [[Circuit Discovery (α„’α…¬α„…α…© ᄇᅑᆯ견)|Circuit Discovery (회둜 발견)]] ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "κ±°λŒ€ λͺ¨λΈ μ†μ—μ„œ ꡬ체적인 κΈ°λŠ₯을 μˆ˜ν–‰ν•˜λŠ” μž‘μ€ μ•Œκ³ λ¦¬μ¦˜ 지도λ₯Ό 그렀라" β€” 신경망 λ‚΄λΆ€μ˜ νŠΉμ • λ‰΄λŸ°κ³Ό ν—€λ“œλ“€μ΄ μ–΄λ–»κ²Œ μ—°κ²°λ˜μ–΄ 논리적 κΈ°λŠ₯을 μˆ˜ν–‰ν•˜λŠ”μ§€ μ‹λ³„ν•΄λ‚΄λŠ” 기계적 해석 κ°€λŠ₯μ„±(Mechanistic Interpretability)의 핡심 기법. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **μΆ”μΆœλœ νŒ¨ν„΄:** λͺ¨λΈ 전체λ₯Ό λΈ”λž™λ°•μŠ€λ‘œ λ³΄λŠ” λŒ€μ‹ , νŠΉμ • νƒœμŠ€ν¬(예: κ°„μ ‘ λͺ©μ μ–΄ 식별)λ₯Ό μˆ˜ν–‰ν•  λ•Œ ν™œμ„±ν™”λ˜λŠ” μ΅œμ†Œν•œμ˜ κ°€μ€‘μΉ˜μ™€ 경둜λ₯Ό μΆ”μΆœν•˜λŠ” '회둜(Circuit)' 식별 νŒ¨ν„΄. - **μ„ΈλΆ€ λ‚΄μš©:** - **Activation Patching:** νŠΉμ • λ‰΄λŸ°μ˜ ν™œμ„±ν™” 값을 λ‹€λ₯Έ μž…λ ₯κ°’μœΌλ‘œ ꡐ체해보며 결과에 λ―ΈμΉ˜λŠ” 인과적 영ν–₯을 μΈ‘μ •. - **Path Patching:** λ ˆμ΄μ–΄ κ°„μ˜ ꡬ체적인 μ—°κ²° 경둜λ₯Ό μΆ”μ ν•˜μ—¬ 정보가 μ–΄λ–»κ²Œ 흐λ₯΄λŠ”μ§€(Information Flow) λ§€ν•‘. - **Induction Heads:** 이전 νŒ¨ν„΄μ„ λ³΅μ‚¬ν•˜κ±°λ‚˜ λ¬Έλ§₯을 μ΄ν•΄ν•˜λŠ” 데 νŠΉν™”λœ νŠΉμ • μ–΄ν…μ…˜ ν—€λ“œ ꡬ쑰의 발견. - **Automated Circuit Discovery (ACD):** λ°©λŒ€ν•œ νŒŒλΌλ―Έν„° 쀑 μœ μ˜λ―Έν•œ 연결망을 μ•Œκ³ λ¦¬μ¦˜μ μœΌλ‘œ μžλ™ 탐색. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & RL Update) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** λ‹¨μˆœ μ‹œκ°ν™”(Saliency Map) μˆ˜μ€€μ„ λ„˜μ–΄, λͺ¨λΈ λ‚΄λΆ€μ—μ„œ μˆ˜ν•™μ μœΌλ‘œ μ •μ˜ κ°€λŠ₯ν•œ μ•Œκ³ λ¦¬μ¦˜μ„ μ°Ύμ•„λ‚΄λŠ” μ •κ΅ν•œ λ‹¨κ³„λ‘œ μ§„ν™”. - **μ •μ±… λ³€ν™”:** λͺ¨λΈμ˜ μ•ˆμ „μ„± 검증(Alignment)을 μœ„ν•΄ 잠재적인 μœ ν•΄ 논리 νšŒλ‘œκ°€ ν˜•μ„±λ˜μ—ˆλŠ”μ§€ κ°μ§€ν•˜λŠ” λ„κ΅¬λ‘œ ν™œμš© 비쀑 ν™•λŒ€. ## πŸ”— 지식 μ—°κ²° (Graph) - **Parent:** 10_Wiki/πŸ’‘ Topics/AI - **Related:** Mechanistic-Interpretability, Neuron-Attribution, Feature-Visualization - **Raw Source:** 00_Raw/2026-04-20/Circuit Discovery.md