--- id: wiki-2026-0508-positive-reinforcement title: Positive Reinforcement category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Operant Conditioning Reinforcement, Reward-based Learning] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [behaviorism, psychology, reinforcement-learning, skinner, conditioning] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: gymnasium, stable-baselines3 --- # Positive Reinforcement ## 매 한 줄 > **"매 행동 직후 desirable stimulus 추가 → 그 행동 빈도 증가."**. Skinner의 operant conditioning 핵심 mechanism (1938~). Modern AI에서 매 RL의 reward signal과 직접 연결되며, RLHF / Constitutional AI / DPO의 conceptual root. ## 매 핵심 ### 매 4-사분면 (Operant Conditioning) | | 자극 추가 (positive) | 자극 제거 (negative) | |---|---|---| | 행동 증가 (reinforcement) | **Positive Reinforcement** (칭찬, 보상) | Negative Reinforcement (시끄러운 소리 멈춤) | | 행동 감소 (punishment) | Positive Punishment (혼냄) | Negative Punishment (특권 박탈) | 매 "positive" = 추가, "negative" = 제거. 좋고 나쁨이 아님. ### 매 schedule (강화 스케줄) - **Continuous (CRF)**: 매 행동마다 reward — 빠른 학습, 빠른 소거. - **Fixed Ratio (FR)**: 매 N회 행동 후 — piecework. - **Variable Ratio (VR)**: 평균 N회, 매 unpredictable — 도박, SNS 알림. 매 가장 강력하고 소거 저항. - **Fixed Interval (FI)**: 매 N초 후 첫 행동. - **Variable Interval (VI)**: 평균 N초, random — 매 steady response rate. ### 매 RL 연결 - Reward signal r_t = positive reinforcement 의 mathematical formalization. - Policy gradient: 매 reward 받은 action 의 probability 증가 — 정확히 positive reinforcement. - RLHF: human preference → reward model → policy update — 매 large-scale positive reinforcement. ### 매 응용 1. Education (token economy, gamification). 2. Animal training (clicker training). 3. ABA therapy for autism. 4. Workplace incentive design. 5. App engagement (variable reward — Hooked Model). 6. RL agent training (game, robotics, LLM). ## 💻 패턴 ### Policy gradient (REINFORCE) — positive reinforcement formalized ```python import torch, torch.nn.functional as F def reinforce_step(policy, optim, states, actions, rewards, gamma=0.99): # discounted return R, returns = 0.0, [] for r in reversed(rewards): R = r + gamma * R returns.insert(0, R) returns = torch.tensor(returns) returns = (returns - returns.mean()) / (returns.std() + 1e-8) logits = policy(torch.stack(states)) logp = F.log_softmax(logits, dim=-1) chosen = logp.gather(1, torch.tensor(actions).unsqueeze(1)).squeeze(1) loss = -(chosen * returns).mean() # 매 reward-weighted log-likelihood optim.zero_grad(); loss.backward(); optim.step() ``` ### Reward shaping (sparse → dense) ```python def shaped_reward(state, next_state, goal): progress = -abs(next_state - goal) + abs(state - goal) return 1.0 if next_state == goal else 0.1 * progress # 매 step마다 작은 positive ``` ### Variable ratio schedule simulator ```python import random class VariableRatio: def __init__(self, mean_n=5): self.mean = mean_n; self.count = 0; self.target = self._draw() def _draw(self): return max(1, int(random.expovariate(1/self.mean))) def step(self): self.count += 1 if self.count >= self.target: self.count = 0; self.target = self._draw() return True # reward return False ``` ### Token economy (educational app) ```python class TokenEconomy: def __init__(self): self.tokens = 0 def reinforce(self, behavior, weight=1): # 매 desired behavior 직후 token 추가 (positive reinforcement) self.tokens += weight def redeem(self, cost, item): if self.tokens >= cost: self.tokens -= cost; return item ``` ### RLHF reward model (modern LLM positive reinforcement at scale) ```python # pseudocode of preference -> reward -> PPO def train_reward_model(prefs): # prefs: (chosen, rejected) pairs # log-sigmoid pairwise loss return ... def ppo_update(policy, ref, rm, prompts): completions = policy.sample(prompts) rewards = rm(prompts, completions) - kl(policy, ref) # 매 reward로 policy update — positive reinforcement at scale return ppo_step(policy, prompts, completions, rewards) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | 빠른 행동 습득 | Continuous reinforcement (CRF) | | 행동 유지 + 소거 저항 | Variable Ratio (VR) | | 시간 기반 task | Fixed/Variable Interval | | RL agent | Reward shaping + sparse goal reward | | LLM alignment | RLHF / DPO (preference-based) | | Education / habit | Token economy + variable bonus | **기본값**: 학습 phase는 CRF, 유지 phase는 VR. 매 punishment보다 reinforcement 우선. ## 🔗 Graph - 부모: [[Operant_Conditioning]] - 응용: [[Reinforcement_Learning]] · [[RLHF]] · [[Gamification]] - Adjacent: [[Policy_Gradient]] ## 🤖 LLM 활용 **언제**: RL agent reward design, LLM RLHF/DPO pipeline 설계, gamification UX, behavior change app. **언제 X**: 매 intrinsic motivation 영역 (creative work)에서 매 over-reinforcement는 매 motivation crowding-out 일으킬 수 있음. ## ❌ 안티패턴 - **Reward hacking**: agent가 매 reward signal exploit (실제 task 무시) — Goodhart's law. 매 reward shaping 신중. - **Confusing positive with "good"**: positive = 추가, "좋은" 의미 X. Punishment도 positive 가능. - **Continuous reinforcement only**: 매 빠른 소거 — VR 전환 필요. - **Punishment as default**: 매 fear/avoidance 유발, learning quality 저하 — reinforcement 우선. - **Delayed reward without bridging stimulus**: 매 association 약함 — clicker 같은 marker 필요. ## 🧪 검증 / 중복 - Verified (Skinner 1938 'Behavior of Organisms', APA Dictionary, Sutton & Barto RL textbook, OpenAI RLHF papers). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — operant conditioning quadrants + RL/RLHF connection |