--- id: CLAMP-001 category: "10_Wiki/πŸ’‘ Topics/AI" confidence_score: 1.0 tags: [ai-[[Interpretability|Interpretability]], mechanistic-interpretability, steering, neural-networks] last_reinforced: 2026-04-26 --- # Feature Clamping (ν”Όμ²˜ κ³ μ • 기법) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "λͺ¨λΈ λ‚΄λΆ€μ˜ νŠΉμ • κ°œλ…μ„ κ°•μ œλ‘œ κ³ μ •ν•˜μ—¬ 좜λ ₯을 μ‘°μ’…ν•˜λΌ" β€” 신경망 λ‚΄λΆ€μ˜ νŠΉμ • ν™œμ„±ν™”(Activation) 값을 μΈμœ„μ μœΌλ‘œ κ³ μ •(Clamp)ν•˜μ—¬ λͺ¨λΈμ˜ ν–‰λ™μ΄λ‚˜ μŠ€νƒ€μΌμ„ μ œμ–΄ν•˜λŠ” 기법. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **μΆ”μΆœλœ νŒ¨ν„΄:** λͺ¨λΈμ΄ νŠΉμ • κ°œλ…(예: '정쀑함' λ˜λŠ” '독일어')을 μ²˜λ¦¬ν•˜λŠ” λ‚΄λΆ€ λ‰΄λŸ° 집합을 μ°Ύμ•„λ‚Έ λ’€, κ·Έ 값을 μ΅œλŒ€μΉ˜λ‘œ κ³ μ •ν•˜μ—¬ λͺ¨λ“  좜λ ₯에 ν•΄λ‹Ή μ„±μ§ˆμ΄ κ°•μ œλ‘œ λ‚˜νƒ€λ‚˜κ²Œ ν•˜λŠ” 'μŠ€ν‹°μ–΄λ§(Steering)' νŒ¨ν„΄. - **μ„ΈλΆ€ λ‚΄μš©:** - **Activation Extraction:** νŠΉμ • νƒœμŠ€ν¬ μ‹œ ν™œμ„±ν™”λ˜λŠ” 핡심 벑터 λ°©ν–₯ 식별. - **Constant Injection:** μΆ”λ‘  κ³Όμ •μ—μ„œ νŠΉμ • λ ˆμ΄μ–΄μ˜ ν™œμ„±ν™” 값을 κ³„μ‚°λœ 값이 μ•„λ‹Œ, 사전에 μ •μ˜λœ 'κ³ μ •κ°’'으둜 λŒ€μ²΄. - **Model Steering:** νŒŒμΈνŠœλ‹ 없이도 λͺ¨λΈμ˜ μ–΄μ‘°, 주제, μ–Έμ–΄ 등을 μ‹€μ‹œκ°„μœΌλ‘œ 쑰율 κ°€λŠ₯. - **Ablation Study:** λ°˜λŒ€λ‘œ νŠΉμ • 값을 0으둜 κ³ μ •ν•˜μ—¬ ν•΄λ‹Ή κΈ°λŠ₯이 λͺ¨λΈμ—μ„œ μ–΄λ–€ 역할을 ν•˜λŠ”μ§€ λΆ„μ„ν•˜λŠ” μš©λ„λ‘œλ„ μ‚¬μš©. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & RL Update) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** λ‹¨μˆœνžˆ ν”„λ‘¬ν”„νŠΈλ‘œ μœ λ„ν•˜λ˜ λ°©μ‹μ—μ„œ, λͺ¨λΈμ˜ λ‘λ‡Œ(ν™œμ„±ν™” μΈ΅)λ₯Ό 직접 μ œμ–΄ν•˜λŠ” ν•˜λ“œμ›¨μ–΄μ  μ ‘κ·ΌμœΌλ‘œμ˜ μ§„ν™”. - **μ •μ±… λ³€ν™”:** λͺ¨λΈμ˜ 편ν–₯μ΄λ‚˜ μœ ν•΄μ„±μ„ μ œκ±°ν•˜κΈ° μœ„ν•΄ νŠΉμ • '뢀정적 ν”Όμ²˜'λ₯Ό μ–΅μ œ(Negative Clamping)ν•˜λŠ” μ•ˆμ „ κ°€λ“œλ ˆμΌλ‘œ ν™œμš© 연ꡬ 쀑. ## πŸ”— 지식 μ—°κ²° (Graph) - **Parent:** 10_Wiki/πŸ’‘ Topics/AI - **Related:** Mechanistic-Interpretability, Circuit-Discovery, Activation-Patching - **Raw Source:** 10_Wiki/Topics/AI/Feature Clamping (α„‘α…΅α„Žα…₯ α„€α…©α„Œα…₯α†Ό).md