--- id: [[P-Reinforce|P-Reinforce]]-AI-SELF-PLAY category: "10_Wiki/๐Ÿ’ก Topics/AI" confidence_score: 0.98 tags: [ReinforcementLearning, SelfPlay, AlphaGo, Scale] last_reinforced: 2026-04-20 --- # [[Self-Play (แ„Œแ…กแ„€แ…ต แ„ƒแ…ขแ„€แ…งแ†ฏ แ„€แ…ตแ„‡แ…กแ†ซ แ„€แ…กแ†ผแ„’แ…ชแ„’แ…กแ†จแ„‰แ…ณแ†ธ)|Self-Play (์ž๊ธฐ ๋Œ€๊ฒฐ ๊ธฐ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์–ด์ œ์˜ ๋‚˜๋ฅผ ์ด๊ธฐ๋ฉฐ ๋Š์ž„์—†์ด ์ง„ํ™”ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜." ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ ์—†์ด๋„ ๋ชจ๋ธ์ด ์ž๊ธฐ ์ž์‹ ๊ณผ ๋Œ€๊ฒฐํ•˜๋ฉฐ ์ƒˆ๋กœ์šด ์ „๋žต์„ ๋ฐœ๊ฒฌํ•˜๊ณ  ์‹ค๋ ฅ์„ ๋ฌดํ•œํžˆ ํ™•์žฅํ•˜๋Š” ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฒ•์ด๋‹ค. ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) - **Core Mechanism**: - ๋ฐ”๋‘‘, ์ฒด์Šค ๋“ฑ ๋Œ€์นญ์  ๊ฒŒ์ž„ ํ™˜๊ฒฝ์—์„œ ์ธ๊ณต์ง€๋Šฅ์ด ์ž์‹ ์˜ ๋ณต์ œ๋ณธ(Current vs Best-so-far)๊ณผ ์ˆ˜์ฒœ๋งŒ ๋ฒˆ์˜ ๋Œ€๊ตญ์„ ๋ฐ˜๋ณตํ•จ. - ์ด๋ฅผ ํ†ตํ•ด ์ธ๊ฐ„์˜ ๊ธฐ๋ณด(Data)์— ๊ฐ‡ํžˆ์ง€ ์•Š๊ณ , ์ธ๊ฐ„์ด ์ƒ๊ฐ์ง€ ๋ชปํ•œ ์ฐฝ์˜์ ์ด๊ณ  ๊ฐ•๋ ฅํ•œ ์ „๋žต์„ ์Šค์Šค๋กœ ์ฐพ์•„๋‚ธ๋‹ค. - **Breakthrough Examples**: - **AlphaZero**: ์•„๋ฌด๋Ÿฐ ์‚ฌ์ „ ์ง€์‹ ์—†์ด ์ž๊ธฐ ๋Œ€๊ฒฐ๋งŒ์œผ๋กœ ๋ฐ”๋‘‘, ์ฒด์Šค, ์‡ผ๊ธฐ์—์„œ ์„ธ๊ณ„ ์ตœ๊ฐ• ๋‹ฌ์„ฑ. - **OpenAI Five**: ๋„ํƒ€2(Dota 2) ์ž๊ฐ€ ๋Œ€๊ฒฐ์„ ํ†ตํ•ด ํ˜‘๋ ฅ ๋ฐ ๊ณ ์ฐจ์› ์ „๋žต ์Šต๋“. - **Requirement**: ์ •ํ™•ํ•œ ๋ณด์ƒ ํ™˜๊ฒฝ(Winning/Losing)๊ณผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์†๋„๊ฐ€ ๋’ท๋ฐ›์นจ๋˜์–ด์•ผ ํ•œ๋‹ค. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (RL Update) - ์ž๊ฐ€ ๋Œ€๊ฒฐ์€ ๊ฒŒ์ž„์ฒ˜๋Ÿผ ๊ทœ์น™์ด ๋ช…ํ™•ํ•œ ๊ณณ์—์„  ํ™˜์ƒ์ ์ด์ง€๋งŒ, ์ •๋‹ต์ด ์—†๋Š” ์–ธ์–ด ๋ชจ๋ธ(Chat) ์˜์—ญ์—์„œ๋Š” ์ž๊ธฐ ๋ณต์ œ์— ์˜ํ•œ ๋ฐ์ดํ„ฐ ๋‹จ์ผํ™” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์ˆ˜์˜ ์—์ด์ „ํŠธ๊ฐ€ ์„œ๋กœ ํ† ๋ก ํ•˜๋Š” ๋ฐฉ์‹(Multi-agent debate) ๋“ฑ์œผ๋กœ ํ™•์žฅ๋˜๊ณ  ์žˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - Related: AlphaGo-Zero , [[Reinforcement Learning (RL)|Reinforcement Learning (RL]] - [[Strategy|Strategy]]: Multi-Agent Debate (์—์ด์ „ํŠธ ๊ฐ„ ํ† ๋ก  ์ „๋žต)