Files
2nd/10_Wiki/Topics/General Knowledge/Model-Free RL vs Model-Based RL.md
T
2026-05-10 22:08:15 +09:00

181 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-model-free-rl-vs-model-based-rl
title: Model-Free RL vs Model-Based RL
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [MFRL vs MBRL, Model-Based Reinforcement Learning, World Model]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [reinforcement-learning, machine-learning, planning, world-model]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: PyTorch/JAX
---
# Model-Free RL vs Model-Based RL
## 매 한 줄
> **"매 environment dynamics 의 learn 하나, 의 X 하나 — sample efficiency 의 vs simplicity 의 trade"**. Model-free (Q-learning, PPO) 매 reward signal 의 만으로 policy 의 update — simple 의 brittle. Model-based (Dreamer, MuZero) 매 world model 의 learn → 매 imagined rollout 의 train. 2026 의 Dreamer V3, EfficientZero, DayDreamer 의 robotics deployment — sample efficiency 의 1-2 orders.
## 매 핵심
### 매 dichotomy
- **Model-free**: $\pi(a|s)$ 또는 $Q(s,a)$ 의 직접 learn. 매 transition $p(s'|s,a)$ 의 access 의 X.
- **Model-based**: $\hat{p}(s'|s,a)$, $\hat{r}(s,a)$ 의 learn → 매 plan / imagined rollout / Dyna-style.
### 매 trade-off table
| Axis | Model-Free | Model-Based |
|---|---|---|
| Sample efficiency | Low | **High** (10-100×) |
| Compute per update | Low | **High** |
| Asymptotic perf | **Often higher** | Bounded by model error |
| Stability | **Stable** | Compounding model error |
| Transfer | Poor | **Better** (model 의 reuse) |
| Implementation | **Simple** | Complex |
### 매 modern flavors
- **Model-free**: PPO, SAC, DQN family, TD3.
- **Model-based**: Dreamer V3 (RSSM), MuZero (planning + value tree), TD-MPC2, PILCO (Gaussian process).
- **Hybrid**: MBPO (model-generated rollouts → SAC), Dyna-Q.
### 매 응용
1. Robotics (sample-efficient sim-to-real).
2. Atari/board game (MuZero).
3. Drug design (sample-efficient exploration).
4. Game NPC behavior (PPO 의 still default).
## 💻 패턴
### PPO — model-free policy gradient (gymnasium)
```python
import torch, torch.nn as nn, torch.nn.functional as F
from torch.distributions import Categorical
class ActorCritic(nn.Module):
def __init__(self, obs_dim, n_act):
super().__init__()
self.shared = nn.Sequential(nn.Linear(obs_dim, 64), nn.Tanh(),
nn.Linear(64, 64), nn.Tanh())
self.pi = nn.Linear(64, n_act)
self.v = nn.Linear(64, 1)
def forward(self, x):
h = self.shared(x)
return Categorical(logits=self.pi(h)), self.v(h).squeeze(-1)
def ppo_step(ac, opt, batch, clip=0.2, vf_c=0.5, ent_c=0.01):
dist, v = ac(batch.obs)
logp = dist.log_prob(batch.act)
ratio = torch.exp(logp - batch.logp_old)
surr1 = ratio * batch.adv
surr2 = torch.clamp(ratio, 1-clip, 1+clip) * batch.adv
pi_loss = -torch.min(surr1, surr2).mean()
v_loss = F.mse_loss(v, batch.ret)
ent = dist.entropy().mean()
loss = pi_loss + vf_c * v_loss - ent_c * ent
opt.zero_grad(); loss.backward(); opt.step()
return loss.item()
```
### Dreamer-style world model (RSSM skeleton)
```python
class RSSM(nn.Module):
def __init__(self, obs_dim, act_dim, h=200, z=32):
super().__init__()
self.gru = nn.GRUCell(z + act_dim, h)
self.prior = nn.Linear(h, 2 * z) # μ, σ
self.post = nn.Linear(h + obs_dim, 2 * z)
self.dec_obs = nn.Linear(h + z, obs_dim)
self.dec_rew = nn.Linear(h + z, 1)
def step(self, h, z, a, obs=None):
h = self.gru(torch.cat([z, a], -1), h)
pri_mu, pri_log = self.prior(h).chunk(2, -1)
if obs is not None:
po_mu, po_log = self.post(torch.cat([h, obs], -1)).chunk(2, -1)
z = po_mu + torch.exp(po_log) * torch.randn_like(po_mu)
else:
z = pri_mu + torch.exp(pri_log) * torch.randn_like(pri_mu)
return h, z, (pri_mu, pri_log)
def imagine(self, h, z, policy, T=15):
states = []
for _ in range(T):
a = policy(torch.cat([h, z], -1))
h, z, _ = self.step(h, z, a)
states.append((h, z))
return states
```
### Dyna-Q (hybrid — tabular)
```python
def dyna_q(env, n_planning=10, episodes=500, alpha=0.1, gamma=0.99, eps=0.1):
Q = defaultdict(lambda: np.zeros(env.action_space.n))
model = {} # (s,a) → (r, s')
for _ in range(episodes):
s, _ = env.reset()
done = False
while not done:
a = np.random.randint(env.action_space.n) if np.random.random() < eps \
else int(np.argmax(Q[s]))
s2, r, done, *_ = env.step(a)
Q[s][a] += alpha * (r + gamma * Q[s2].max() - Q[s][a])
model[(s, a)] = (r, s2)
for _ in range(n_planning): # 매 imagined step
(sp, ap), (rp, sp2) = random.choice(list(model.items())), None
rp, sp2 = model[(sp, ap)]
Q[sp][ap] += alpha * (rp + gamma * Q[sp2].max() - Q[sp][ap])
s = s2
```
### MuZero (planning sketch — value/policy net + MCTS)
```python
# 매 environment 의 black-box; learned (representation, dynamics, prediction) heads
# search 매 imagined trajectory 의 over MCTS — replay 매 (search policy, search value, n-step return)
# 의 train. (full impl 매 muzero_general repo)
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Lots of cheap simulation | **Model-free** (PPO/SAC) — simpler |
| Real-robot, expensive samples | **Model-based** (Dreamer V3, TD-MPC2) |
| Discrete board game | **MuZero** — planning 의 wins |
| Continuous control benchmark | SAC or DreamerV3 |
| Fast prototype | PPO — most stable, easiest to tune |
| Long-horizon planning | Model-based + planning |
**기본값**: prototype 매 PPO. Sample 매 expensive — Dreamer V3 / TD-MPC2.
## 🔗 Graph
- 부모: [[Reinforcement Learning]] · [[Markov Decision Process]]
- 변형: [[PPO]] · [[SAC]] · [[Dreamer V3]] · [[MuZero]] · [[TD-MPC2]]
- 응용: [[Sim-to-Real]] · [[Robotics RL]] · [[AlphaZero]]
- Adjacent: [[World Model]] · [[Planning]] · [[Dyna-Q]]
## 🤖 LLM 활용
**언제**: trade-off explanation, algorithm choice, pseudocode skeleton.
**언제 X**: 매 hyperparameter — paper-specific 의 cross-check (Dreamer V3 매 sensitivity 의 paper 의 careful).
## ❌ 안티패턴
- **MBRL 의 default 의 reach**: 매 cheap-sim 환경 의 PPO 의 win 매 simpler.
- **Imagined rollout 의 too-long horizon**: 매 model error compounds — 5-15 step 의 typical.
- **MFRL 의 sparse reward 의 hope**: 매 exploration 의 add (RND, ICM) — 또는 의 model-based 의 switch.
- **MuZero 의 small problem 의 use**: 매 overkill — tabular Q 의 enough.
- **Single-seed report**: 매 RL variance huge — 5+ seeds 의 IQM (Agarwal et al. 2021).
## 🧪 검증 / 중복
- Verified (Sutton & Barto 2nd ed. 2018; Hafner _DreamerV3_ 2023; Schrittwieser _MuZero_ Nature 2020).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — MFRL/MBRL trade-off + DreamerV3/MuZero 정리 |