--- id: [[P-Reinforce|P-Reinforce]]-AI-IMITATION category: Dev confidence_score: 0.95 tags: [AI, ReinforcementLearning, ImitationLearning, [[Robotics|Robotics]]] last_reinforced: 2026-04-20 --- # [[Imitation-Learning|Imitation-Learning]] (λͺ¨λ°© ν•™μŠ΅) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "맨땅에 ν—€λ”©ν•˜μ§€ 말고, 슀승의 μ‹œλ²”μ„ 보고 λ°°μ›ŒλΌ." 보상 ν•¨μˆ˜κ°€ μ—†κ±°λ‚˜ μ •μ˜ν•˜κΈ° μ–΄λ €μšΈ λ•Œ, μ „λ¬Έκ°€(인간 λ“±)의 μ‹œμ—° 데이터λ₯Ό λͺ¨λ°©ν•˜μ—¬ 정책을 ν•™μŠ΅μ‹œν‚€λŠ” 방식이닀. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **Why Imitation?**: κ°•ν™”ν•™μŠ΅μ—μ„œ ν¬μ†Œν•œ 보상(Sparse Reward) λ¬Έμ œλŠ” ν•™μŠ΅μ„ λΆˆκ°€λŠ₯ν•˜κ²Œ ν•œλ‹€. μ „λ¬Έκ°€μ˜ 자취λ₯Ό λ”°λΌκ°€λŠ” 것은 훨씬 λΉ λ₯Έ 경둜λ₯Ό μ œκ³΅ν•œλ‹€. - **Methods**: - **[[Behavior|Behavior]]al Cloning (BC)**: μ‹œμ—° 데이터λ₯Ό λ‹¨μˆœν•œ 지도 ν•™μŠ΅(Supervised Learning)으둜 ν•™μŠ΅. (데이터 λ°–μ˜ 상황에 μ·¨μ•½) - **Inverse Reinforcement Learning (IRL)**: μ „λ¬Έκ°€μ˜ ν–‰λ™μœΌλ‘œλΆ€ν„° κ·Έκ°€ μΆ”κ΅¬ν•˜λŠ” '보상 ν•¨μˆ˜'λ₯Ό μ—­μœΌλ‘œ 좔둠함. - **GAIL (Generative Adversarial Imitation Learning)**: GAN ꡬ쑰λ₯Ό ν™œμš©ν•΄ μ‹œμ—°μžμ™€ ꡬ뢄이 μ•ˆ λ˜λŠ” 행동을 ν•˜λ„λ‘ ν•™μŠ΅. - **Domain**: μžμœ¨μ£Όν–‰, λ‘œλ΄‡ νŒ” μ œμ–΄, κ°œμΈν™”λœ μ—μ΄μ „νŠΈ. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (RL Update) - λͺ¨λ°© ν•™μŠ΅μ˜ 치λͺ…적 ν•œκ³„λŠ” 'μŠ€μŠΉλ³΄λ‹€ μž˜ν•  수 μ—†λ‹€'λŠ” 것과 μ‹œμ—° 데이터에 μ—†λŠ” 상황(Out-of-distribution)을 λ§Œλ‚˜λ©΄ λ¬΄λ„ˆμ§„λ‹€λŠ” 것이닀. 이λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ λͺ¨λ°© ν•™μŠ΅μœΌλ‘œ 초기 정책을 작고, 이후 κ°•ν™”ν•™μŠ΅(RL)으둜 슀슀둜 νƒν—˜ν•˜λ©° ν•œκ³„λ₯Ό λŒνŒŒν•˜λŠ” ν•˜μ΄λΈŒλ¦¬λ“œ μ „λž΅μ΄ μ£Όλ₯˜λ‹€. ## πŸ”— 지식 μ—°κ²° (Graph) - Related: [[Reinforcement Learning (RL)|Reinforcement Learning (RL)]] , [[Inverse-Reinforcement-Learning|Inverse-Reinforcement-Learning]] - Comparison: RLHF (인간 ν”Όλ“œλ°± 기반 κ°•ν™”ν•™μŠ΅)