--- id: RL-ENV-001 category: "[[10_Wiki/๐Ÿ’ก Topics/AI]]" confidence_score: 1.0 tags: [reinforcement-learning, ai, environment-design, mdp, simulation] last_reinforced: 2026-04-26 --- # [[Environment Design in RL (๊ฐ•ํ™”ํ•™์Šต์—์„œ์˜ ํ™˜๊ฒฝ ์„ค๊ณ„)]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์—์ด์ „ํŠธ๊ฐ€ ๋ฌด์—‡์„ ๋ฐฐ์šธ์ง€๋Š” ์—์ด์ „ํŠธ๊ฐ€ ์ฒ˜ํ•œ ํ™˜๊ฒฝ๊ณผ ๋ณด์ƒ์˜ ๊ตฌ์กฐ๊ฐ€ ๊ฒฐ์ •ํ•œ๋‹ค" โ€” ๊ฐ•ํ™”ํ•™์Šต ๋ชจ๋ธ์ด ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ํ–‰๋™์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ์ƒํƒœ ๊ณต๊ฐ„, ํ–‰๋™ ๊ณต๊ฐ„, ์ „์ด ํ™•๋ฅ , ๊ทธ๋ฆฌ๊ณ  ๋ณด์ƒ ํ•จ์ˆ˜(Reward Function)๋ฅผ ์ˆ˜ํ•™์ /๊ณตํ•™์ ์œผ๋กœ ์ •๊ตํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ณผ์ •. ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) - **์ถ”์ถœ๋œ ํŒจํ„ด:** ๋ณต์žกํ•œ ํ˜„์‹ค ์„ธ๊ณ„๋ฅผ ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ๊ณผ์ •(MDP)์œผ๋กœ ์ถ”์ƒํ™”ํ•˜๊ณ , ์—์ด์ „ํŠธ๊ฐ€ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์œ ๋„๋˜๋„๋ก ๋ณด์ƒ์˜ ๋นˆ๋„์™€ ๊ฐ•๋„๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๋ณด์ƒ ์„ค๊ณ„(Reward Engineering) ํŒจํ„ด. - **ํ•ต์‹ฌ ์š”์†Œ:** - **State Space (S):** ํ•™์Šต์— ํ•„์š”ํ•œ ์ •๋ณด๋งŒ ํฌํ•จํ•˜๋˜ ์ฐจ์›์˜ ์ €์ฃผ๋ฅผ ํ”ผํ•˜๋„๋ก ์„ค๊ณ„. - **Action Space (A):** ์—ฐ์†์  vs ์ด์‚ฐ์  ํ–‰๋™ ์ •์˜. - **Reward Function (R):** Sparse Reward(๋ณด์ƒ์ด ๋“œ๋ฌพ) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ Reward Shaping ๋„์ž…. - **Simulator Fidelity:** ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์˜ ์ •๋ฐ€๋„์™€ ์—ฐ์‚ฐ ์†๋„ ์‚ฌ์ด์˜ ๊ท ํ˜•. - **์˜์˜:** ์•Œ๊ณ ๋ฆฌ์ฆ˜๋งŒํผ์ด๋‚˜ '์–ด๋–ค ํ™˜๊ฒฝ์—์„œ ํ•™์Šต์‹œํ‚ค๋Š”๊ฐ€'๊ฐ€ ๋ชจ๋ธ์˜ ์ตœ์ข… ์„ฑ๋Šฅ๊ณผ ์•ˆ์ „์„ฑ์„ ๊ฒฐ์ •ํ•จ. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ:** ๋‹จ์ˆœํžˆ ์ตœ์ข… ๋ชฉํ‘œ ๋‹ฌ์„ฑ ์‹œ์—๋งŒ ํฐ ๋ณด์ƒ์„ ์ฃผ๋˜ ๋ฐฉ์‹์—์„œ, ์ค‘๊ฐ„ ๊ณผ์ •์— ๋Œ€ํ•œ ํžŒํŠธ(Shaping)๋ฅผ ์ฃผ์–ด ํ•™์Šต ๋‚œ์ด๋„๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ™”. - **์ •์ฑ… ๋ณ€ํ™”:** Skybound ํ”„๋กœ์ ํŠธ์˜ ํ•จ๋Œ€ ์ „ํˆฌ AI ํ•™์Šต ์‹œ, ์  ์ฒ˜์น˜๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์•„๊ตฐ ๋ณดํ˜ธ ๋ฐ ์—ฐ๋ฃŒ ํšจ์œจ์„ฑ ๋“ฑ ๋‹ค๊ฐ๋„์˜ ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ ๊ท ํ˜• ์žกํžŒ ์ „๋žต์„ ์œ ๋„ํ•จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Reinforcement-Learning]], [[Markov-Decision-Process-MDP]], [[Reward-Shaping]], [[Simulation-Principles]] - **Raw Source:** [[10_Wiki/Topics/AI/Environment-Design-in-RL.md]]