--- id: [[P-Reinforce|P-Reinforce]]-AUTO-FTAL-001 category: Unified confidence_score: 1.00 tags: [auto-reinforced, fine-tuning, alignment, sft, rlhf, dpo, llm-training] last_reinforced: 2026-05-04 --- # [[Fine-Tuning & Alignment|Fine-Tuning & Alignment]] ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "μ•Όμƒμ˜ λͺ¨λΈμ„ μ‹ μ‚¬λ‘œ λ§Œλ“œλŠ” κ³Όμ •: λ°©λŒ€ν•œ 지식을 배운 사전 ν•™μŠ΅(Pre-training) λͺ¨λΈμ—κ²Œ μΈκ°„μ˜ μ–Έμ–΄ κ·œλ²”κ³Ό μ§€μ‹œ 이행 λŠ₯λ ₯을 κ°€λ₯΄μΉ˜κ³ , κ°€μΉ˜κ΄€μ„ μ •λ ¬ν•˜μ—¬ μ‹€μ§ˆμ μœΌλ‘œ 'μ‚¬μš© κ°€λŠ₯ν•œ' λ„κ΅¬λ‘œ μ™„μ„±ν•˜λŠ” μ •κ΅ν•œ 쑰각 기술." ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) κ±°λŒ€ μ–Έμ–΄ λͺ¨λΈ(LLM)의 μ„±λŠ₯을 κ·ΉλŒ€ν™”ν•˜κ³  νŠΉμ • λͺ©μ μ— 맞게 μ‘°μ •ν•˜κΈ° μœ„ν•΄ ν•„μˆ˜μ μΈ 후속 ν•™μŠ΅ 및 μ •λ ¬ ν”„λ‘œμ„ΈμŠ€μž…λ‹ˆλ‹€. 1. **SFT (Supervised Fine-Tuning)**: * **μ •μ˜**: κ³ ν’ˆμ§ˆμ˜ [질문, λ‹΅λ³€] μŒμ„ μ‚¬μš©ν•˜μ—¬ λͺ¨λΈμ΄ μ§€μ‹œμ‚¬ν•­(Instruction)을 λ”°λ₯΄λŠ” 법을 배우게 ν•˜λŠ” λ‹¨κ³„μž…λ‹ˆλ‹€. * **μ—­ν• **: λͺ¨λΈμ΄ κ°€μ§„ 지식을 κΊΌλ‚΄λŠ” '말문'을 ν‹”μ›Œμ£Όλ©°, νŠΉμ • λ¬Έμ²΄λ‚˜ ν˜•μ‹μ„ μŠ΅λ“μ‹œν‚΅λ‹ˆλ‹€. 2. **RLHF (Reinforcement Learning from Human Feedback)**: * **μ •μ˜**: μΈκ°„μ˜ μ„ ν˜Έλ„(Preference)λ₯Ό λ°˜μ˜ν•˜μ—¬ λͺ¨λΈμ„ 더 μœ μš©ν•˜κ³  μ•ˆμ „ν•˜κ²Œ μ •λ ¬ν•˜λŠ” κΈ°μˆ μž…λ‹ˆλ‹€. * **ν”„λ‘œμ„ΈμŠ€**: [SFT] $\rightarrow$ [Reward Model ν•™μŠ΅] $\rightarrow$ [PPO λ“± κ°•ν™”ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μœΌλ‘œ λͺ¨λΈ μ΅œμ ν™”]. 3. **DPO (Direct Preference Optimization)**: * **μ •μ˜**: λ³΅μž‘ν•œ 보상 λͺ¨λΈκ³Ό κ°•ν™”ν•™μŠ΅ 루프 없이, μ„ ν˜Έλ„ 데이터λ₯Ό μ‚¬μš©ν•˜μ—¬ 직접 λͺ¨λΈμ„ μ΅œμ ν™”ν•˜λŠ” 효율적인 λŒ€μ•ˆ κΈ°λ²•μž…λ‹ˆλ‹€. * **μž₯점**: νŒŒμ΄ν”„λΌμΈμ΄ λ‹¨μˆœν•˜κ³  ν•™μŠ΅μ΄ μ•ˆμ •μ μ΄λ©°, μ΅œμ‹  Llama μ‹œλ¦¬μ¦ˆ λ“± μ£Όμš” λͺ¨λΈμ˜ ν‘œμ€€ μ •λ ¬ λ°©μ‹μœΌλ‘œ μ±„νƒλ˜μ—ˆμŠ΅λ‹ˆλ‹€. 4. **Grokking (κ·Έλ‘œν‚Ή)**: * ν›ˆλ ¨ 데이터 μ•”κΈ°(과적합) μƒνƒœλ₯Ό λ„˜μ–΄, μ–΄λŠ μˆœκ°„ κ°‘μžκΈ° 데이터 이면의 μ‹€μ œ κ·œμΉ™(μ•Œκ³ λ¦¬μ¦˜)을 깨우치며 μΌλ°˜ν™” μ„±λŠ₯이 ν­λ°œν•˜λŠ” ν˜„μƒμ„ μ˜λ―Έν•©λ‹ˆλ‹€. ## βš–οΈ Trade-offs & Caveats * **Catastrophic Forgetting (파괴적 망각)**: νŠΉμ • μž‘μ—…μ— λŒ€ν•΄ λ„ˆλ¬΄ κ°•ν•˜κ²Œ λ―Έμ„Έ μ‘°μ •ν•  경우, λͺ¨λΈμ΄ μ›λž˜ κ°€μ§€κ³  있던 일반적인 μƒμ‹μ΄λ‚˜ λ‹€λ₯Έ λŠ₯λ ₯을 μžƒμ–΄λ²„λ¦΄ 수 μžˆμŠ΅λ‹ˆλ‹€. * **Alignment Tax (μ •λ ¬μ„Έ)**: λͺ¨λΈμ„ λ„ˆλ¬΄ μ•ˆμ „ν•˜κ²Œλ§Œ μ •λ ¬(Over-alignment)ν•˜λ©΄, μ •λ‹Ήν•œ μ§ˆλ¬Έμ—λ„ "λ‹΅λ³€ν•  수 μ—†μŠ΅λ‹ˆλ‹€"라고 κ±°μ ˆν•˜κ±°λ‚˜ μ°½μ˜μ„±μ΄ κ°μ†Œν•˜λŠ” λΆ€μž‘μš©μ΄ λ°œμƒν•©λ‹ˆλ‹€. * **Smiling Facade**: RLHFκ°€ λͺ¨λΈμ˜ 내뢀적인 결함을 κ³ μΉ˜λŠ” 것이 μ•„λ‹ˆλΌ, κ²‰μœΌλ‘œλ§Œ κ·ΈλŸ΄λ“―ν•œ 닡변을 λ‚΄λ†“κ²Œ ν•˜λŠ” 'κ°€λ©΄'을 μ”Œμš°λŠ” 것일 수 μžˆλ‹€λŠ” λΉ„νŒμ  μ‹œκ°μ΄ μ‘΄μž¬ν•©λ‹ˆλ‹€. ## πŸ”— 지식 μ—°κ²° (Graph) * **μƒμœ„ κ°œλ…**: [[LLM Training Pipeline|LLM Training Pipeline]] * **μ„ΈλΆ€ 기술**: [[PEFT & LoRA|PEFT & LoRA]], [[RLHF & DPO|RLHF & DPO]], [[Constitutional AI|Constitutional AI]] * **μ—°κ΄€ λͺ¨λΈ**: [[DeepSeek-R1|DeepSeek-R1]], [[Claude|Claude]], [[Llama|Llama]] --- *Last updated: 2026-05-04*