--- id: P-REINFORCE-AI-CIRCUIT-DISCOVERY category: "10_Wiki/πŸ’‘ Topics/AI" confidence_score: 0.92 tags: [Interpretability, MechanisticInterpretability, NeuralNetworks] last_reinforced: 2026-04-20 --- # [[Circuit Discovery (α„’α…¬α„…α…© ᄇᅑᆯ견)|Circuit Discovery (회둜 발견)]] ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "κ±°λŒ€ν•œ 신경망 μ†μ—μ„œ νŠΉμ • κΈ°λŠ₯을 μˆ˜ν–‰ν•˜λŠ” 'μž‘μ€ λΆ€ν’ˆ'을 μ°Ύμ•„λ‚΄λŠ” κ³ κ³ ν•™." λ”₯λŸ¬λ‹ λͺ¨λΈ λ‚΄λΆ€μ˜ λ‰΄λŸ°κ³Ό κ°€μ€‘μΉ˜λ“€μ΄ μ–΄λ–»κ²Œ κ²°ν•©ν•˜μ—¬ νŠΉμ • μ•Œκ³ λ¦¬μ¦˜(예: κ°„μ ‘ λͺ©μ μ–΄ 식별)을 κ΅¬ν˜„ν•˜λŠ”μ§€ λ°νžˆλŠ” 과정이닀. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **Methodology**: - **Ablation (제거)**: νŠΉμ • λ‰΄λŸ°μ΄λ‚˜ 측을 λΉ„ν™œμ„±ν™”ν–ˆμ„ λ•Œ μ„±λŠ₯ λ³€ν™”λ₯Ό κ΄€μ°°ν•˜μ—¬ μ€‘μš”λ„λ₯Ό μΈ‘μ •ν•œλ‹€. - **Activation Patching**: νŠΉμ • μž…λ ₯에 λŒ€ν•œ 쀑간 ν™œμ„±κ°’μ„ λ‹€λ₯Έ μž…λ ₯에 μ£Όμž…ν•˜μ—¬ 정보 흐름을 μ—­μΆ”μ ν•œλ‹€. - **Found Components**: - **Induction Heads**: 이전 νŒ¨ν„΄μ„ κΈ°μ–΅ν•˜κ³  λ°˜λ³΅ν•˜λŠ” μž‘μ€ 회둜. Context-based learning의 핡심. - **Indirect Object Identification (IOI) Circuit**: λ¬Έμž₯μ—μ„œ κ°„μ ‘ λͺ©μ μ–΄λ₯Ό μ°Ύμ•„λ‚΄λŠ” 20μ—¬ 개의 λ‰΄λŸ° κ·Έλ£Ή. - **Significance**: λΈ”λž™λ°•μŠ€μΈ AI λͺ¨λΈμ„ 해석 κ°€λŠ₯ν•œ μ‹œμŠ€ν…œμœΌλ‘œ μ „ν™˜ν•˜μ—¬ μ•ˆμ „μ„±(Safety)κ³Ό μ œμ–΄ κ°€λŠ₯성을 ν™•λ³΄ν•œλ‹€. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (RL Update) - ν˜„μž¬μ˜ 회둜 λ°œκ²¬μ€ 주둜 μž‘μ€ λͺ¨λΈ(GPT-2 λ“±)μ—μ„œ 성곡적이며, μˆ˜μ²œμ–΅ 개의 νŒŒλΌλ―Έν„°λ₯Ό κ°€μ§„ λŒ€κ·œλͺ¨ λͺ¨λΈμ—μ„œλŠ” 회둜의 쀑첩과 λ³΅μž‘μ„± λ•Œλ¬Έμ— μžλ™ν™”λœ 회둜 발견(Automated Circuit Discovery) 기술이 ν™œλ°œνžˆ μ—°κ΅¬λ˜κ³  μžˆλ‹€. ## πŸ”— 지식 μ—°κ²° (Graph) - Related: [[Mechanistic Interpretability (α„€α…΅α„€α…¨α„Œα…₯ᆨ ᄒᅒᄉα…₯ᆨ ᄀᅑ능ᄉα…₯α†Ό)|Mechanistic Interpretability (기계적 해석 κ°€λŠ₯μ„±)]] , Monosemanticity (μΌμ˜μ„±) - Concepts: Superposition (쀑첩)