--- id: RL-POLICY-DIFF-001 category: "10_Wiki/๐Ÿ’ก Topics/AI" confidence_score: 1.0 tags: [ai, reinforcement-learning, on-policy, off-policy, q-learning, sarsa] last_reinforced: 2026-04-26 --- # Off-policy vs On-policy Learning (์˜คํ”„-ํด๋ฆฌ์‹œ vs ์˜จ-ํด๋ฆฌ์‹œ ํ•™์Šต) ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "ํ˜„์žฌ ๋‚ด๊ฐ€ ๊ฑท๋Š” ๊ธธ์—์„œ ์ง์ ‘ ๊นจ๋‹ฌ์„ ๊ฒƒ์ธ๊ฐ€(On), ์•„๋‹ˆ๋ฉด ํƒ€์ธ์˜ ๋ฐœ์ž์ทจ๋‚˜ ๊ณผ๊ฑฐ์˜ ์ผ๊ธฐ์—์„œ ์ง„๋ฆฌ๋ฅผ ์บ˜ ๊ฒƒ์ธ๊ฐ€(Off)์˜ ์„ ํƒ" โ€” ์—์ด์ „ํŠธ๊ฐ€ ํ•™์Šตํ•˜๋Š” ์ •์ฑ…(Target Policy)๊ณผ ์‹ค์ œ๋กœ ํ–‰๋™ํ•˜๋Š” ์ •์ฑ…(Behavior Policy)์˜ ์ผ์น˜ ์—ฌ๋ถ€์— ๋”ฐ๋ฅธ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ถ„๋ฅ˜ ์ฒด๊ณ„. ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) - **์ถ”์ถœ๋œ ํŒจํ„ด:** "Direct Experience vs Decoupled Learning" โ€” ํ˜„์žฌ์˜ ๋ฌด์ž‘์œ„ ํ–‰๋™์ด ๋‹ค์Œ ํ•™์Šต์— ์ฆ‰๊ฐ ๋ฐ˜์˜๋˜์–ด ์•ˆ์ •์„ฑ์„ ํ™•๋ณดํ•˜๋Š” ํŒจํ„ด(On-policy)๊ณผ, ๊ณผ๊ฑฐ์˜ ๊ฒฝํ—˜ ๋ฐ์ดํ„ฐ(Experience Replay)๋ฅผ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ํŒจํ„ด(Off-policy) ์‚ฌ์ด์˜ ์ „๋žต์  ์„ ํƒ. - **์ฃผ์š” ์ฐจ์ด์ :** - **On-policy (์˜ˆ: SARSA, PPO):** ์ž์‹ ์ด ์‹ค์ œ๋กœ ์ˆ˜ํ–‰ํ•œ ํ–‰๋™์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ€์น˜๋ฅผ ์—…๋ฐ์ดํŠธ. ํ•™์Šต์ด ์•ˆ์ •์ ์ด์ง€๋งŒ ํƒ์ƒ‰ ๋ฐ์ดํ„ฐ ๋‚ญ๋น„๊ฐ€ ํผ. - **Off-policy (์˜ˆ: Q-learning, DQN, SAC):** ์ตœ์ ์˜ ํ–‰๋™์„ ๊ฐ€์ •ํ•˜๊ณ  ์—…๋ฐ์ดํŠธํ•˜๊ฑฐ๋‚˜ ๋‹ค๋ฅธ ์—์ด์ „ํŠธ์˜ ๊ธฐ๋ก์—์„œ๋„ ํ•™์Šต ๊ฐ€๋Šฅ. ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์ด ์••๋„์ ์œผ๋กœ ๋†’์ง€๋งŒ ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•  ์ˆ˜ ์žˆ์Œ. - **์˜์˜:** ์‹ค์ œ ๋กœ๋ด‡ ์ œ์–ด๋‚˜ ๊ฒŒ์ž„ AI ์„ค๊ณ„ ์‹œ, ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋น„์šฉ๊ณผ ํ•™์Šต์˜ ์•ˆ์ •์„ฑ ์ค‘ ๋ฌด์—‡์„ ์šฐ์„ ํ• ์ง€์— ๋Œ€ํ•œ ํ•ต์‹ฌ ์„ค๊ณ„ ๊ธฐ์ค€์ด ๋จ. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ:** ์˜คํ”„-ํด๋ฆฌ์‹œ๊ฐ€ ๋ฌด์กฐ๊ฑด ํšจ์œจ์ ์ด๋ผ๋Š” ๋ฏฟ์Œ์—์„œ ๋ฒ—์–ด๋‚˜, ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ๋„ˆ๋ฌด ํ‹€์–ด์ง€๋ฉด ์ˆ˜๋ ดํ•˜์ง€ ๋ชปํ•˜๋Š” 'Deadly Triad' ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ์„ ์ธ์ง€ํ•˜๊ณ , ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ์ •๊ตํ•œ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฒ•(Prioritized Experience Replay ๋“ฑ)์ด ํ˜„๋Œ€์  ์ •์„์ด ๋จ. - **์ •์ฑ… ๋ณ€ํ™”:** Antigravity ํ”„๋กœ์ ํŠธ๋Š” ์—์ด์ „ํŠธ์˜ ์ƒˆ๋กœ์šด ๋„๊ตฌ ์‚ฌ์šฉ ์Šคํ‚ฌ์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ, ์ดˆ๊ธฐ์—๋Š” ์˜คํ”„-ํด๋ฆฌ์‹œ ๊ธฐ๋ฐ˜์˜ ๋Œ€๋Ÿ‰ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ๋กœ ์ง€๋Šฅ์„ ์Œ“๊ณ , ์‹ค์ „ ๋‹จ๊ณ„์—์„œ๋Š” ์˜จ-ํด๋ฆฌ์‹œ ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ์•ˆ์ „ํ•˜๊ณ  ์ •๊ตํ•˜๊ฒŒ ๋ฏธ์„ธ ์กฐ์ •ํ•œ ์ •์ฑ…์„ ์ˆ˜๋ฆฝํ•จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Reinforcement-Learning|Reinforcement-Learning]], [[Markov-Decision-Process-MDP|Markov-Decision-Process-MDP]], Experience-Replay-Strategies, Proximal-Policy-Optimization-PPO - **Raw Source:** 10_Wiki/Topics/AI/Off-policy-vs-On-policy-Learning.md