--- id: P-REINFORCE-AI-DQN category: "10_Wiki/๐Ÿ’ก Topics/AI" confidence_score: 0.97 tags: [ReinforcementLearning, DQN, DeepMind, QLearning] last_reinforced: 2026-04-20 --- # [[Deep Q-Networks (DQN)|Deep Q-Networks (DQN)]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "๊ณ ์ „ ๊ฒŒ์ž„๊ธฐ๋ฅผ ์ •๋ณตํ•œ ๋”ฅ๋Ÿฌ๋‹๊ณผ ๊ฐ•ํ™”ํ•™์Šต์˜ ์‚ฌ์ƒ ์ฒซ ๋ฒˆ์งธ ๊ฒฐํ•ฉ." ์ƒํƒœ ๊ฐ€์น˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ณ ์ „์ ์ธ Q-Learning์— ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์„ ๋„์ž…ํ•˜์—ฌ ํ”ฝ์…€ ์ •๋ณด๋งŒ์œผ๋กœ ์ธ๊ฐ„ ์ด์ƒ์˜ ๊ฒŒ์ž„ ์‹ค๋ ฅ์„ ๋‹ฌ์„ฑํ•œ ๊ธฐ๋…๋น„์  ๋…ผ๋ฌธ์ด๋‹ค. ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) - **Key Innovations**: - **Deep Neural Network as Q-Function**: ๋ณต์žกํ•˜๊ณ  ๊ณ ์ฐจ์›์ ์ธ ์ƒํƒœ(์˜ˆ: ํ™”๋ฉด ํ”ฝ์…€)๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๊ฐ ํ–‰๋™์˜ ๊ฐ€์น˜๋ฅผ ๊ณ„์‚ฐํ•˜๋„๋ก CNN์„ ์‚ฌ์šฉํ•จ. - **Experience Replay**: ๊ฒฝํ—˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•ด๋‘๊ณ  ๋ฌด์ž‘์œ„๋กœ ์ถ”์ถœํ•˜์—ฌ ํ•™์Šตํ•จ์œผ๋กœ์จ ๋ฐ์ดํ„ฐ ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„(Correlation)๋ฅผ ๋Š๊ณ  ์•ˆ์ •์„ฑ์„ ํ™•๋ณดํ•จ. - **Target Network**: ๊ฐ€์น˜ ์˜ˆ์ธก๊ฐ’๊ณผ ๋ชฉํ‘œ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ํ•™์Šต ์ค‘ ๋ชฉํ‘œ๊ฐ’์ด ์š”๋™์น˜๋Š” ํ˜„์ƒ์„ ๋ฐฉ์ง€ํ•จ. - **Legacy**: ์•„ํƒ€๋ฆฌ(Atari) ๊ฒŒ์ž„ ์ •๋ณต์„ ํ†ตํ•ด ํ˜„๋Œ€ ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต(Deep RL) ์‹œ๋Œ€๋ฅผ ์—ด์—ˆ๋‹ค. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (RL Update) - DQN์€ ๊ฐ€์น˜ ๊ธฐ๋ฐ˜(Value-based) ๋ฐฉ์‹์ด๊ธฐ์— ํ–‰๋™ ๊ณต๊ฐ„์ด ์—ฐ์†์ ์ธ(Continuous) ๋ฌธ์ œ์—๋Š” ์ ์šฉํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ๋˜ํ•œ ๊ฐ€์น˜ ๊ฐ’์„ ๊ณผ๋Œ€ํ‰๊ฐ€(Overestimation)ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์–ด, ์ด๋ฅผ ๋ณด์™„ํ•œ Double DQN, Dueling DQN ๋“ฑ์œผ๋กœ ์ง„ํ™”ํ•˜์˜€๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - Related: [[Reinforcement Learning (RL)|Reinforcement Learning (RL)]] , [[Bellman-Equation|Bellman-Equation]] - Contrast: Policy Gradient Methods