--- id: RL-MAB-001 category: "10_Wiki/๐Ÿ’ก Topics/AI" confidence_score: 1.0 tags: [ai, reinforcement-learning, multi-armed-bandit, exploration-exploitation, optimization] last_reinforced: 2026-04-26 --- # Multi-armed Bandit Problem (๋‹ค์ค‘ ์Šฌ๋กฏ๋จธ์‹  ๋ฌธ์ œ) ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์ตœ์„ ์˜ ๋ณด์ƒ์„ ์ฃผ๋Š” ์Šฌ๋กฏ๋จธ์‹ ์„ ์ฐพ๊ธฐ ์œ„ํ•ด, ์ต์ˆ™ํ•œ ๊ธฐ๊ณ„๋ฅผ ๋‹น๊ธธ ๊ฒƒ์ธ๊ฐ€(Exploit) ์•„๋‹ˆ๋ฉด ์ƒˆ๋กœ์šด ๊ธฐ๊ณ„์— ๋„์ „ํ•  ๊ฒƒ์ธ๊ฐ€(Explore)์˜ ๊ท ํ˜•์„ ์žก์•„๋ผ" โ€” ์ œํ•œ๋œ ์ž์›์œผ๋กœ ์ตœ๋Œ€์˜ ์ด์ต์„ ์–ป๊ธฐ ์œ„ํ•ด ํƒ์ƒ‰๊ณผ ํ™œ์šฉ ์‚ฌ์ด์˜ ๋”œ๋ ˆ๋งˆ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฐ€์žฅ ๊ธฐ์ดˆ์ ์ธ ์ˆœ์ฐจ์  ์˜์‚ฌ๊ฒฐ์ • ๋ชจ๋ธ. ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) - **์ถ”์ถœ๋œ ํŒจํ„ด:** "Dynamic Allocation under Uncertainty" โ€” ์–ด๋–ค ์„ ํƒ์ง€๊ฐ€ ๊ฐ€์žฅ ์ข‹์€์ง€ ๋ชจ๋ฅด๋Š” ์ƒํƒœ์—์„œ, ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋ฉฐ ์ ์ง„์ ์œผ๋กœ ๋” ์œ ๋งํ•œ ์„ ํƒ์ง€์— ์ž์›์„ ์ง‘์ค‘ ํˆฌ์ž…ํ•˜์—ฌ ํ›„ํšŒ(Regret)๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ํŒจํ„ด. - **์ฃผ์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜:** - **$\epsilon$-Greedy:** ๋Œ€๋ถ€๋ถ„์€ ๊ฐ€์žฅ ์ข‹์€ ๊ฒƒ์„ ์„ ํƒํ•˜๋˜, ์•„์ฃผ ๋‚ฎ์€ ํ™•๋ฅ ($\epsilon$)๋กœ ์ƒˆ๋กœ์šด ์‹œ๋„๋ฅผ ํ•จ. - **UCB (Upper Confidence Bound):** ๋ณด์ƒ์˜ ๋ถˆํ™•์‹ค์„ฑ(๋ถ„์‚ฐ)์ด ๋†’์€ ์„ ํƒ์ง€์— ๋ณด๋„ˆ์Šค๋ฅผ ์ฃผ์–ด ํƒ์ƒ‰ ์œ ๋„. - **Thompson Sampling:** ํ™•๋ฅ  ๋ถ„ํฌ(๋ฒ ์ด์ง€์•ˆ)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ์„ ํƒ. - **์˜์˜:** ์ถ”์ฒœ ์‹œ์Šคํ…œ์˜ A/B ํ…Œ์ŠคํŠธ ์ตœ์ ํ™”, ์‹ ์•ฝ ์ž„์ƒ ์‹คํ—˜, ์˜จ๋ผ์ธ ๊ด‘๊ณ  ๋…ธ์ถœ ์ œ์–ด ๋“ฑ ์‹ค์‹œ๊ฐ„ ํ”ผ๋“œ๋ฐฑ์ด ์ค‘์š”ํ•œ ๋น„์ฆˆ๋‹ˆ์Šค ์˜์‚ฌ๊ฒฐ์ •์˜ ํ•ต์‹ฌ ๋„๊ตฌ. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ:** ๋‹จ์ˆœํžˆ '๊ฐ€์žฅ ๋†’์€ ํ‰๊ท '์„ ์ฐพ๋Š” ๊ฒƒ์„ ๋„˜์–ด, ์ด์ œ๋Š” ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๋ณด์ƒ ํ™•๋ฅ ์ด ๋ณ€ํ•˜๋Š” ๋น„์ •์ (Non-stationary) ํ™˜๊ฒฝ์ด๋‚˜ ๋ฌธ๋งฅ ์ •๋ณด(Contextual Bandit)๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ง€๋Šฅํ™”๋จ. - **์ •์ฑ… ๋ณ€ํ™”:** Antigravity ํ”„๋กœ์ ํŠธ๋Š” ์—์ด์ „ํŠธ๊ฐ€ ์—ฌ๋Ÿฌ ๋„๊ตฌ(Tool) ์ค‘ ํ˜„์žฌ ๋ฌธ์ œ ํ•ด๊ฒฐ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ๋„๊ตฌ๋ฅผ ์„ ํƒํ•  ๋•Œ, ๊ณผ๊ฑฐ ์„ฑ๊ณต๋ฅ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํ†ฐ์Šจ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ์ตœ์ ์˜ ๋„๊ตฌ ํ™œ์šฉ ์ „๋žต์„ ์ˆ˜๋ฆฝํ•จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Reinforcement-Learning]], [[Monte-Carlo-Tree-Search-MCTS]], Expected-Utility-Theory, A-B-Testing-Optimization - **Raw Source:** 10_Wiki/Topics/AI/Multi-armed-Bandit-Problem.md