Files
2nd/10_Wiki/Topics/General Knowledge/Model-Free RL vs Model-Based RL.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

180 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-model-free-rl-vs-model-based-rl
title: Model-Free RL vs Model-Based RL
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [MFRL vs MBRL, Model-Based Reinforcement Learning, World Model]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [reinforcement-learning, machine-learning, planning, world-model]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: PyTorch/JAX
---
# Model-Free RL vs Model-Based RL
## 매 한 줄
> **"매 environment dynamics 의 learn 하나, 의 X 하나 — sample efficiency 의 vs simplicity 의 trade"**. Model-free (Q-learning, PPO) 매 reward signal 의 만으로 policy 의 update — simple 의 brittle. Model-based (Dreamer, MuZero) 매 world model 의 learn → 매 imagined rollout 의 train. 2026 의 Dreamer V3, EfficientZero, DayDreamer 의 robotics deployment — sample efficiency 의 1-2 orders.
## 매 핵심
### 매 dichotomy
- **Model-free**: $\pi(a|s)$ 또는 $Q(s,a)$ 의 직접 learn. 매 transition $p(s'|s,a)$ 의 access 의 X.
- **Model-based**: $\hat{p}(s'|s,a)$, $\hat{r}(s,a)$ 의 learn → 매 plan / imagined rollout / Dyna-style.
### 매 trade-off table
| Axis | Model-Free | Model-Based |
|---|---|---|
| Sample efficiency | Low | **High** (10-100×) |
| Compute per update | Low | **High** |
| Asymptotic perf | **Often higher** | Bounded by model error |
| Stability | **Stable** | Compounding model error |
| Transfer | Poor | **Better** (model 의 reuse) |
| Implementation | **Simple** | Complex |
### 매 modern flavors
- **Model-free**: PPO, SAC, DQN family, TD3.
- **Model-based**: Dreamer V3 (RSSM), MuZero (planning + value tree), TD-MPC2, PILCO (Gaussian process).
- **Hybrid**: MBPO (model-generated rollouts → SAC), Dyna-Q.
### 매 응용
1. Robotics (sample-efficient sim-to-real).
2. Atari/board game (MuZero).
3. Drug design (sample-efficient exploration).
4. Game NPC behavior (PPO 의 still default).
## 💻 패턴
### PPO — model-free policy gradient (gymnasium)
```python
import torch, torch.nn as nn, torch.nn.functional as F
from torch.distributions import Categorical
class ActorCritic(nn.Module):
def __init__(self, obs_dim, n_act):
super().__init__()
self.shared = nn.Sequential(nn.Linear(obs_dim, 64), nn.Tanh(),
nn.Linear(64, 64), nn.Tanh())
self.pi = nn.Linear(64, n_act)
self.v = nn.Linear(64, 1)
def forward(self, x):
h = self.shared(x)
return Categorical(logits=self.pi(h)), self.v(h).squeeze(-1)
def ppo_step(ac, opt, batch, clip=0.2, vf_c=0.5, ent_c=0.01):
dist, v = ac(batch.obs)
logp = dist.log_prob(batch.act)
ratio = torch.exp(logp - batch.logp_old)
surr1 = ratio * batch.adv
surr2 = torch.clamp(ratio, 1-clip, 1+clip) * batch.adv
pi_loss = -torch.min(surr1, surr2).mean()
v_loss = F.mse_loss(v, batch.ret)
ent = dist.entropy().mean()
loss = pi_loss + vf_c * v_loss - ent_c * ent
opt.zero_grad(); loss.backward(); opt.step()
return loss.item()
```
### Dreamer-style world model (RSSM skeleton)
```python
class RSSM(nn.Module):
def __init__(self, obs_dim, act_dim, h=200, z=32):
super().__init__()
self.gru = nn.GRUCell(z + act_dim, h)
self.prior = nn.Linear(h, 2 * z) # μ, σ
self.post = nn.Linear(h + obs_dim, 2 * z)
self.dec_obs = nn.Linear(h + z, obs_dim)
self.dec_rew = nn.Linear(h + z, 1)
def step(self, h, z, a, obs=None):
h = self.gru(torch.cat([z, a], -1), h)
pri_mu, pri_log = self.prior(h).chunk(2, -1)
if obs is not None:
po_mu, po_log = self.post(torch.cat([h, obs], -1)).chunk(2, -1)
z = po_mu + torch.exp(po_log) * torch.randn_like(po_mu)
else:
z = pri_mu + torch.exp(pri_log) * torch.randn_like(pri_mu)
return h, z, (pri_mu, pri_log)
def imagine(self, h, z, policy, T=15):
states = []
for _ in range(T):
a = policy(torch.cat([h, z], -1))
h, z, _ = self.step(h, z, a)
states.append((h, z))
return states
```
### Dyna-Q (hybrid — tabular)
```python
def dyna_q(env, n_planning=10, episodes=500, alpha=0.1, gamma=0.99, eps=0.1):
Q = defaultdict(lambda: np.zeros(env.action_space.n))
model = {} # (s,a) → (r, s')
for _ in range(episodes):
s, _ = env.reset()
done = False
while not done:
a = np.random.randint(env.action_space.n) if np.random.random() < eps \
else int(np.argmax(Q[s]))
s2, r, done, *_ = env.step(a)
Q[s][a] += alpha * (r + gamma * Q[s2].max() - Q[s][a])
model[(s, a)] = (r, s2)
for _ in range(n_planning): # 매 imagined step
(sp, ap), (rp, sp2) = random.choice(list(model.items())), None
rp, sp2 = model[(sp, ap)]
Q[sp][ap] += alpha * (rp + gamma * Q[sp2].max() - Q[sp][ap])
s = s2
```
### MuZero (planning sketch — value/policy net + MCTS)
```python
# 매 environment 의 black-box; learned (representation, dynamics, prediction) heads
# search 매 imagined trajectory 의 over MCTS — replay 매 (search policy, search value, n-step return)
# 의 train. (full impl 매 muzero_general repo)
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Lots of cheap simulation | **Model-free** (PPO/SAC) — simpler |
| Real-robot, expensive samples | **Model-based** (Dreamer V3, TD-MPC2) |
| Discrete board game | **MuZero** — planning 의 wins |
| Continuous control benchmark | SAC or DreamerV3 |
| Fast prototype | PPO — most stable, easiest to tune |
| Long-horizon planning | Model-based + planning |
**기본값**: prototype 매 PPO. Sample 매 expensive — Dreamer V3 / TD-MPC2.
## 🔗 Graph
- 부모: [[Reinforcement Learning]]
- 변형: [[PPO]]
- Adjacent: [[World Model]] · [[Planning]]
## 🤖 LLM 활용
**언제**: trade-off explanation, algorithm choice, pseudocode skeleton.
**언제 X**: 매 hyperparameter — paper-specific 의 cross-check (Dreamer V3 매 sensitivity 의 paper 의 careful).
## ❌ 안티패턴
- **MBRL 의 default 의 reach**: 매 cheap-sim 환경 의 PPO 의 win 매 simpler.
- **Imagined rollout 의 too-long horizon**: 매 model error compounds — 5-15 step 의 typical.
- **MFRL 의 sparse reward 의 hope**: 매 exploration 의 add (RND, ICM) — 또는 의 model-based 의 switch.
- **MuZero 의 small problem 의 use**: 매 overkill — tabular Q 의 enough.
- **Single-seed report**: 매 RL variance huge — 5+ seeds 의 IQM (Agarwal et al. 2021).
## 🧪 검증 / 중복
- Verified (Sutton & Barto 2nd ed. 2018; Hafner _DreamerV3_ 2023; Schrittwieser _MuZero_ Nature 2020).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — MFRL/MBRL trade-off + DreamerV3/MuZero 정리 |