--- category: Unified tags: [auto-consolidated, technical-documentation] title: [[Exploration vs Exploitation|Exploration vs Exploitation]] last_updated: 2026-05-02 --- # [[Exploration vs Exploitation|Exploration vs Exploitation]] ## πŸ“Œ Brief Summary > "λͺ¨ν—˜κ³Ό μ•ˆμ£Όμ˜ μ €μšΈμ§ˆ: 이미 μ•Œκ³  μžˆλŠ” μ΅œμ„ μ„ μ„ νƒν•˜μ—¬ ν™•μ‹€ν•œ 이득을 μ±™κΈΈ 것인가(Exploitation), μ•„λ‹ˆλ©΄ 더 큰 보상이 μžˆμ„μ§€ λͺ¨λ₯΄λŠ” μƒˆλ‘œμš΄ μ˜μ—­μ„ νƒν—˜ν•  것인가(Exploration) μ‚¬μ΄μ˜ μ˜μ›ν•œ μ „λž΅μ  λ”œλ ˆλ§ˆ." --- > "μ•ˆμ „ν•œ ν˜„μž¬μ˜ 수읡과 λΆˆν™•μ‹€ν•œ 미래의 κ°€λŠ₯μ„± μ‚¬μ΄μ—μ„œ 졜적의 λ°°νŒ… 지점을 찾아라" β€” κ°•ν™”ν•™μŠ΅μ˜ 핡심 λ”œλ ˆλ§ˆλ‘œ, 이미 μ•Œκ³  μžˆλŠ” μ΅œμ„ μ˜ 행동을 λ°˜λ³΅ν•˜μ—¬ 보상을 μ–»λŠ” 것(Exploitation)κ³Ό 더 λ‚˜μ€ 행동을 μ°ΎκΈ° μœ„ν•΄ μƒˆλ‘œμš΄ μ‹œλ„λ₯Ό ν•˜λŠ” 것(Exploration) μ‚¬μ΄μ˜ νŠΈλ ˆμ΄λ“œμ˜€ν”„. ## πŸ“– Core Content 탐사 λŒ€ 이용(Exploration vs Exploitation)은 κ°•ν™”ν•™μŠ΅κ³Ό μ˜μ‚¬κ²°μ • 이둠의 핡심적인 νŠΈλ ˆμ΄λ“œμ˜€ν”„ λ¬Έμ œμž…λ‹ˆλ‹€. 1. **두 κ°œλ…**: * **Exploitation (이용)**: κ³Όκ±° κ²½ν—˜μƒ 보상이 κ°€μž₯ 컸던 행동을 반볡. 단기 수읡 μ΅œμ ν™”. * **Exploration (탐사)**: 정보가 λΆ€μ‘±ν•œ μƒˆλ‘œμš΄ 행동을 μ‹œλ„. μž₯기적인 '더 λ‚˜μ€ μ΅œμ ν•΄' 발견 κ°€λŠ₯μ„±. 2. **ν•΄κ²° μ „λž΅**: * **Epsilon-Greedy**: λŒ€λΆ€λΆ„($1-\epsilon$)은 μ΄μš©ν•˜λ˜, λ¬΄μž‘μœ„($\epsilon$)둜 탐사. * **UCB (Upper Confidence Bound)**: λΆˆν™•μ‹€μ„±(가보지 μ•Šμ€ κ³³)에 κ°€μ€‘μΉ˜λ₯Ό 두어 탐사 μœ λ„. * **Thompson Sampling**: ν™•λ₯  뢄포λ₯Ό 기반으둜 μœ μ—°ν•˜κ²Œ 선택. --- - **μΆ”μΆœλœ νŒ¨ν„΄:** μ œν•œλœ μžμ›(μ‹œκ°„, μ—λ„ˆμ§€) λ‚΄μ—μ„œ λˆ„μ  보상을 κ·ΉλŒ€ν™”ν•˜κΈ° μœ„ν•΄ μ΄ˆκΈ°μ—λŠ” κ΄‘λ²”μœ„ν•˜κ²Œ νƒμƒ‰ν•˜κ³ , 정보가 μŒ“μΌμˆ˜λ‘ μ΅œμ„ μ˜ 선택에 μ§‘μ€‘ν•˜λŠ” μ μ‘ν˜• μ˜μ‚¬κ²°μ • νŒ¨ν„΄. - **μ£Όμš” μ „λž΅:** - **$\epsilon$-greedy:** μ•„μ£Ό μž‘μ€ ν™•λ₯ ($\epsilon$)둜 λ¬΄μž‘μœ„ 행동을 ν•˜κ³ , λ‚˜λ¨Έμ§€ ν™•λ₯ λ‘œ μ΅œμ„ μ˜ 행동 μˆ˜ν–‰. - **Softmax:** 보상 κ°€μΉ˜μ— λΉ„λ‘€ν•œ ν™•λ₯ λ‘œ 행동 선택. - **Upper Confidence Bound (UCB):** λΆˆν™•μ‹€μ„±μ΄ 큰 행동에 가산점을 μ£Όμ–΄ μš°μ„ μ μœΌλ‘œ 탐색. - **Thompson Sampling:** ν™•λ₯  뢄포λ₯Ό λͺ¨λΈλ§ν•˜μ—¬ μƒ˜ν”Œλ§ 기반으둜 탐색 κ²°μ •. - **의의:** λ„ˆλ¬΄ 빨리 ν™œμš©μ—λ§Œ μ§‘μ€‘ν•˜λ©΄ μ§€μ—­ μ΅œμ ν•΄(Local Optima)에 κ°‡νžˆκ³ , λ„ˆλ¬΄ νƒμƒ‰λ§Œ ν•˜λ©΄ 보상을 μΆ©λΆ„νžˆ μ–»μ§€ λͺ»ν•¨. ## βš–οΈ Trade-offs & Caveats - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌**: κ³Όκ±°μ—λŠ” μ΅œλŒ€ν•œ λΉ λ₯΄κ²Œ 'μ•ˆμ£Ό μ •μ±…'으둜 λ“€μ–΄κ°€λŠ” 것이 효율적이라 λ³΄μ•˜μœΌλ‚˜, ν˜„λŒ€ 정책은 λ³΅μž‘ν•œ ν™˜κ²½μΌμˆ˜λ‘ μ‹œμŠ€ν…œμ— 'ν˜ΈκΈ°μ‹¬(Curiosity) μ •μ±…'을 μ£Όμž…ν•˜μ—¬ λκΉŒμ§€ νƒμ‚¬ν•˜κ²Œ ν•˜λŠ” 것이 ꢁ극의 μ§€λŠ₯을 λ§Œλ“ λ‹€κ³  믿음(RL Update). (Reinforcement Learningκ³Ό μ—°κ²°) - **μ •μ±… λ³€ν™”(RL Update)**: λΉ„μ¦ˆλ‹ˆμŠ€ μ „λž΅ μ •μ±…μ—μ„œ, κΈ°μ‘΄ 수읡 λͺ¨λΈμ— μ•ˆμ£Όν•˜λŠ” 것(Exploitation)κ³Ό 신사업을 λ°œκ΅΄ν•˜λŠ” 것(Exploration) μ‚¬μ΄μ˜ 'μ–‘μ†μž‘μ΄ 경영 μ •μ±…'의 이둠적 ν† λŒ€κ°€ 됨. ([[Strategic-Planning|Strategic-Planning]]κ³Ό μ—°κ²°) --- - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** λ‹¨μˆœνžˆ '운'에 맑기던 λ¬΄μž‘μœ„ νƒμƒ‰μ—μ„œ, μˆ˜ν•™μ  κ·Όκ±°(UCB λ“±)λ₯Ό λ°”νƒ•μœΌλ‘œ 'λ˜‘λ˜‘ν•˜κ²Œ' νƒμƒ‰ν•˜λŠ” λ°©μ‹μœΌλ‘œ μ§„ν™”. - **μ •μ±… λ³€ν™”:** Antigravity ν”„λ‘œμ νŠΈμ˜ 지식 검색 μ—μ΄μ „νŠΈλŠ” μ‚¬μš©μžμ˜ μ§ˆλ¬Έμ— λŒ€ν•΄ κ°€μž₯ κ΄€λ ¨μ„± 높은 λ¬Έμ„œλ§Œ λ³΄μ—¬μ£ΌλŠ” 것(Exploitation)을 λ„˜μ–΄, 가끔은 μ˜μ™Έμ˜ μ—°κ²° 고리λ₯Ό κ°€μ§„ λ¬Έμ„œλ₯Ό μ œμ•ˆ(Exploration)ν•˜μ—¬ 창의적 톡찰을 돕도둝 섀계됨. ## πŸ”— Knowledge Connections - [[Reinforcement Learning (RL)|Reinforcement Learning (RL)]], Multi-Armed Bandit (MAB), [[Decision Theory|Decision Theory]], [[Strategic-Planning|Strategic-Planning]], [[Optimization|Optimization]] - **Modern Tech/Tools**: Recommender[[_system|system]]s (Exploration balance), A/B [[Testing|Testing]] algorithms. --- --- - [[Reinforcement-Learning|Reinforcement-Learning]], Q-Learning-Foundations, Multi-Armed-Bandit-MAB, Decision-Making - **Raw Source:** 10_Wiki/Topics/AI/Exploration-vs-Exploitation.md