--- id: P-REINFORCE-AUTO-EXEX-001 category: "10_Wiki/πŸ’‘ Topics/AI" confidence_score: 0.96 tags: [auto-reinforced, exploration, exploitation, reinforcement-learning, multi-armed-bandit, strategy] last_reinforced: 2026-04-20 --- # [[Exploration vs Exploitation|Exploration vs Exploitation]] ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "λͺ¨ν—˜κ³Ό μ•ˆμ£Όμ˜ μ €μšΈμ§ˆ: 이미 μ•Œκ³  μžˆλŠ” μ΅œμ„ μ„ μ„ νƒν•˜μ—¬ ν™•μ‹€ν•œ 이득을 μ±™κΈΈ 것인가(Exploitation), μ•„λ‹ˆλ©΄ 더 큰 보상이 μžˆμ„μ§€ λͺ¨λ₯΄λŠ” μƒˆλ‘œμš΄ μ˜μ—­μ„ νƒν—˜ν•  것인가(Exploration) μ‚¬μ΄μ˜ μ˜μ›ν•œ μ „λž΅μ  λ”œλ ˆλ§ˆ." ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) 탐사 λŒ€ 이용(Exploration vs Exploitation)은 κ°•ν™”ν•™μŠ΅κ³Ό μ˜μ‚¬κ²°μ • 이둠의 핡심적인 νŠΈλ ˆμ΄λ“œμ˜€ν”„ λ¬Έμ œμž…λ‹ˆλ‹€. 1. **두 κ°œλ…**: * **Exploitation (이용)**: κ³Όκ±° κ²½ν—˜μƒ 보상이 κ°€μž₯ 컸던 행동을 반볡. 단기 수읡 μ΅œμ ν™”. * **Exploration (탐사)**: 정보가 λΆ€μ‘±ν•œ μƒˆλ‘œμš΄ 행동을 μ‹œλ„. μž₯기적인 '더 λ‚˜μ€ μ΅œμ ν•΄' 발견 κ°€λŠ₯μ„±. 2. **ν•΄κ²° μ „λž΅**: * **Epsilon-Greedy**: λŒ€λΆ€λΆ„($1-\epsilon$)은 μ΄μš©ν•˜λ˜, λ¬΄μž‘μœ„($\epsilon$)둜 탐사. * **UCB (Upper Confidence Bound)**: λΆˆν™•μ‹€μ„±(가보지 μ•Šμ€ κ³³)에 κ°€μ€‘μΉ˜λ₯Ό 두어 탐사 μœ λ„. * **Thompson Sampling**: ν™•λ₯  뢄포λ₯Ό 기반으둜 μœ μ—°ν•˜κ²Œ 선택. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & RL Update) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌**: κ³Όκ±°μ—λŠ” μ΅œλŒ€ν•œ λΉ λ₯΄κ²Œ 'μ•ˆμ£Ό μ •μ±…'으둜 λ“€μ–΄κ°€λŠ” 것이 효율적이라 λ³΄μ•˜μœΌλ‚˜, ν˜„λŒ€ 정책은 λ³΅μž‘ν•œ ν™˜κ²½μΌμˆ˜λ‘ μ‹œμŠ€ν…œμ— 'ν˜ΈκΈ°μ‹¬(Curiosity) μ •μ±…'을 μ£Όμž…ν•˜μ—¬ λκΉŒμ§€ νƒμ‚¬ν•˜κ²Œ ν•˜λŠ” 것이 ꢁ극의 μ§€λŠ₯을 λ§Œλ“ λ‹€κ³  믿음(RL Update). (Reinforcement Learningκ³Ό μ—°κ²°) - **μ •μ±… λ³€ν™”(RL Update)**: λΉ„μ¦ˆλ‹ˆμŠ€ μ „λž΅ μ •μ±…μ—μ„œ, κΈ°μ‘΄ 수읡 λͺ¨λΈμ— μ•ˆμ£Όν•˜λŠ” 것(Exploitation)κ³Ό 신사업을 λ°œκ΅΄ν•˜λŠ” 것(Exploration) μ‚¬μ΄μ˜ 'μ–‘μ†μž‘μ΄ 경영 μ •μ±…'의 이둠적 ν† λŒ€κ°€ 됨. (Strategic-Planningκ³Ό μ—°κ²°) ## πŸ”— 지식 μ—°κ²° (Graph) - [[Reinforcement Learning (RL)|Reinforcement Learning (RL)]], Multi-Armed Bandit (MAB), [[Decision Theory|Decision Theory]], [[Strategic-Planning|Strategic-Planning]], [[Optimization|Optimization]] - **Modern Tech/Tools**: Recommender systems (Exploration balance), A/B testing algorithms. ---