--- id: RL-EX-BAL-001 category: "10_Wiki/πŸ’‘ Topics/AI" confidence_score: 1.0 tags: [reinforcement-learning, ai, decision-making, exploration, exploitation] last_reinforced: 2026-04-26 --- # Exploration vs Exploitation (탐색과 ν™œμš©μ˜ κ· ν˜•) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "μ•ˆμ „ν•œ ν˜„μž¬μ˜ 수읡과 λΆˆν™•μ‹€ν•œ 미래의 κ°€λŠ₯μ„± μ‚¬μ΄μ—μ„œ 졜적의 λ°°νŒ… 지점을 찾아라" β€” κ°•ν™”ν•™μŠ΅μ˜ 핡심 λ”œλ ˆλ§ˆλ‘œ, 이미 μ•Œκ³  μžˆλŠ” μ΅œμ„ μ˜ 행동을 λ°˜λ³΅ν•˜μ—¬ 보상을 μ–»λŠ” 것(Exploitation)κ³Ό 더 λ‚˜μ€ 행동을 μ°ΎκΈ° μœ„ν•΄ μƒˆλ‘œμš΄ μ‹œλ„λ₯Ό ν•˜λŠ” 것(Exploration) μ‚¬μ΄μ˜ νŠΈλ ˆμ΄λ“œμ˜€ν”„. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **μΆ”μΆœλœ νŒ¨ν„΄:** μ œν•œλœ μžμ›(μ‹œκ°„, μ—λ„ˆμ§€) λ‚΄μ—μ„œ λˆ„μ  보상을 κ·ΉλŒ€ν™”ν•˜κΈ° μœ„ν•΄ μ΄ˆκΈ°μ—λŠ” κ΄‘λ²”μœ„ν•˜κ²Œ νƒμƒ‰ν•˜κ³ , 정보가 μŒ“μΌμˆ˜λ‘ μ΅œμ„ μ˜ 선택에 μ§‘μ€‘ν•˜λŠ” μ μ‘ν˜• μ˜μ‚¬κ²°μ • νŒ¨ν„΄. - **μ£Όμš” μ „λž΅:** - **$\epsilon$-greedy:** μ•„μ£Ό μž‘μ€ ν™•λ₯ ($\epsilon$)둜 λ¬΄μž‘μœ„ 행동을 ν•˜κ³ , λ‚˜λ¨Έμ§€ ν™•λ₯ λ‘œ μ΅œμ„ μ˜ 행동 μˆ˜ν–‰. - **Softmax:** 보상 κ°€μΉ˜μ— λΉ„λ‘€ν•œ ν™•λ₯ λ‘œ 행동 선택. - **Upper Confidence Bound (UCB):** λΆˆν™•μ‹€μ„±μ΄ 큰 행동에 가산점을 μ£Όμ–΄ μš°μ„ μ μœΌλ‘œ 탐색. - **Thompson Sampling:** ν™•λ₯  뢄포λ₯Ό λͺ¨λΈλ§ν•˜μ—¬ μƒ˜ν”Œλ§ 기반으둜 탐색 κ²°μ •. - **의의:** λ„ˆλ¬΄ 빨리 ν™œμš©μ—λ§Œ μ§‘μ€‘ν•˜λ©΄ μ§€μ—­ μ΅œμ ν•΄(Local Optima)에 κ°‡νžˆκ³ , λ„ˆλ¬΄ νƒμƒ‰λ§Œ ν•˜λ©΄ 보상을 μΆ©λΆ„νžˆ μ–»μ§€ λͺ»ν•¨. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & RL Update) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** λ‹¨μˆœνžˆ '운'에 맑기던 λ¬΄μž‘μœ„ νƒμƒ‰μ—μ„œ, μˆ˜ν•™μ  κ·Όκ±°(UCB λ“±)λ₯Ό λ°”νƒ•μœΌλ‘œ 'λ˜‘λ˜‘ν•˜κ²Œ' νƒμƒ‰ν•˜λŠ” λ°©μ‹μœΌλ‘œ μ§„ν™”. - **μ •μ±… λ³€ν™”:** Antigravity ν”„λ‘œμ νŠΈμ˜ 지식 검색 μ—μ΄μ „νŠΈλŠ” μ‚¬μš©μžμ˜ μ§ˆλ¬Έμ— λŒ€ν•΄ κ°€μž₯ κ΄€λ ¨μ„± 높은 λ¬Έμ„œλ§Œ λ³΄μ—¬μ£ΌλŠ” 것(Exploitation)을 λ„˜μ–΄, 가끔은 μ˜μ™Έμ˜ μ—°κ²° 고리λ₯Ό κ°€μ§„ λ¬Έμ„œλ₯Ό μ œμ•ˆ(Exploration)ν•˜μ—¬ 창의적 톡찰을 돕도둝 섀계됨. ## πŸ”— 지식 μ—°κ²° (Graph) - [[Reinforcement-Learning]], Q-Learning-Foundations, Multi-Armed-Bandit-MAB, Decision-Making - **Raw Source:** 10_Wiki/Topics/AI/Exploration-vs-Exploitation.md