--- id: P-REINFORCE-AI-CIRCUIT category: "[[10_Wiki/πŸ’‘ Topics/AI]]" confidence_score: 0.98 tags: [Interpretability, Neural Networks, Circuit Discovery, Mechanistic Interpretability] last_reinforced: 2026-04-20 --- # [[Circuit-Discovery]] (회둜 발견) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "인곡신경망은 λΈ”λž™λ°•μŠ€κ°€ μ•„λ‹ˆλ‹€." 신경망 λ‚΄λΆ€μ˜ μˆ˜μ–΅ 개 νŒŒλΌλ―Έν„°λ“€ μ‚¬μ΄μ—μ„œ νŠΉμ • 둜직(예: λ§μ…ˆ, 문법 νŒŒμ•…)을 μˆ˜ν–‰ν•˜λŠ” 고유의 'μ‹ κ²½ 회둜'λ₯Ό μ°Ύμ•„ μ§€μ§ˆν•™μ μœΌλ‘œ λΆ„μ„ν•˜λŠ” κ³ λ‚œλ„ κΈ°μˆ μ΄λ‹€. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **Mechanistic Interpretability**: - λͺ¨λΈμ˜ μž…λ ₯을 μ‘°κΈˆμ”© 바꿔보며 νŠΉμ • λ‰΄λŸ°λ“€μ΄ μ–΄λ–»κ²Œ ν™œμ„±ν™”λ˜λŠ”μ§€(Activation Patching λ“±)λ₯Ό λΆ„μ„ν•˜μ—¬, κ°€μ€‘μΉ˜ 속에 μˆ¨κ²¨μ§„ μ•Œκ³ λ¦¬μ¦˜μ„ μ—­μ„€κ³„ν•œλ‹€. - **Induction Heads**: - 이전에 λ³Έ νŒ¨ν„΄μ„ κΈ°μ–΅ν•˜κ³  반볡될 λ•Œ ν™œμ„±ν™”λ˜λŠ” 신경망 λ‚΄μ˜ νŠΉμ • ꡬ쑰. LLM의 λ¬Έλ§₯ 이해 λŠ₯λ ₯의 핡심 원동λ ₯ 쀑 ν•˜λ‚˜λ‘œ λ°ν˜€μ‘Œλ‹€. - **Reverse Engineering**: - ν•™μŠ΅λœ λͺ¨λΈμ„ '읽기'λ₯Ό 톡해 κ·Έ λͺ¨λΈμ΄ μ–΄λ–€ μˆ˜ν•™μ  μ „λž΅μ„ μ‚¬μš©ν•΄ 문제λ₯Ό ν‘ΈλŠ”μ§€ μΈκ°„μ˜ μ–Έμ–΄λ‘œ μ„€λͺ…ν•˜λŠ” κ³Όμ •. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (RL Update) - λŒ€κ·œλͺ¨ λͺ¨λΈ(Llama-3, GPT-4)둜 갈수둝 νšŒλ‘œκ°€ λ„ˆλ¬΄ λ³΅μž‘ν•΄μ Έμ„œ 일일이 λΆ„μ„ν•˜λŠ” 것이 λΆˆκ°€λŠ₯에 κ°€κΉŒμ›Œμ§„λ‹€. μ΅œκ·Όμ—λŠ” λ‹€λ₯Έ 'μž‘μ€ AI'λ₯Ό μ‹œμΌœμ„œ 큰 AI의 회둜λ₯Ό λΆ„μ„ν•˜κ²Œ ν•˜λŠ” μžλ™ν™”λœ 해석 연ꡬ가 μ§„ν–‰ 쀑이닀. ## πŸ”— 지식 μ—°κ²° (Graph) - Related: [[Automated-Reasoning]] , [[Complexity-Theory]] - Foundation: [[Information Theory]]