"매 행동 직후 desirable stimulus 추가 → 그 행동 빈도 증가.". Skinner의 operant conditioning 핵심 mechanism (1938~). Modern AI에서 매 RL의 reward signal과 직접 연결되며, RLHF / Constitutional AI / DPO의 conceptual root.
매 핵심
매 4-사분면 (Operant Conditioning)
자극 추가 (positive)
자극 제거 (negative)
행동 증가 (reinforcement)
Positive Reinforcement (칭찬, 보상)
Negative Reinforcement (시끄러운 소리 멈춤)
행동 감소 (punishment)
Positive Punishment (혼냄)
Negative Punishment (특권 박탈)
매 "positive" = 추가, "negative" = 제거. 좋고 나쁨이 아님.
매 schedule (강화 스케줄)
Continuous (CRF): 매 행동마다 reward — 빠른 학습, 빠른 소거.
Fixed Ratio (FR): 매 N회 행동 후 — piecework.
Variable Ratio (VR): 평균 N회, 매 unpredictable — 도박, SNS 알림. 매 가장 강력하고 소거 저항.
Fixed Interval (FI): 매 N초 후 첫 행동.
Variable Interval (VI): 평균 N초, random — 매 steady response rate.
매 RL 연결
Reward signal r_t = positive reinforcement 의 mathematical formalization.
Policy gradient: 매 reward 받은 action 의 probability 증가 — 정확히 positive reinforcement.
RLHF: human preference → reward model → policy update — 매 large-scale positive reinforcement.
importtorch,torch.nn.functionalasFdefreinforce_step(policy,optim,states,actions,rewards,gamma=0.99):# discounted returnR,returns=0.0,[]forrinreversed(rewards):R=r+gamma*Rreturns.insert(0,R)returns=torch.tensor(returns)returns=(returns-returns.mean())/(returns.std()+1e-8)logits=policy(torch.stack(states))logp=F.log_softmax(logits,dim=-1)chosen=logp.gather(1,torch.tensor(actions).unsqueeze(1)).squeeze(1)loss=-(chosen*returns).mean()# 매 reward-weighted log-likelihoodoptim.zero_grad();loss.backward();optim.step()
Reward shaping (sparse → dense)
defshaped_reward(state,next_state,goal):progress=-abs(next_state-goal)+abs(state-goal)return1.0ifnext_state==goalelse0.1*progress# 매 step마다 작은 positive
classTokenEconomy:def__init__(self):self.tokens=0defreinforce(self,behavior,weight=1):# 매 desired behavior 직후 token 추가 (positive reinforcement)self.tokens+=weightdefredeem(self,cost,item):ifself.tokens>=cost:self.tokens-=cost;returnitem
RLHF reward model (modern LLM positive reinforcement at scale)
# pseudocode of preference -> reward -> PPOdeftrain_reward_model(prefs):# prefs: (chosen, rejected) pairs# log-sigmoid pairwise lossreturn...defppo_update(policy,ref,rm,prompts):completions=policy.sample(prompts)rewards=rm(prompts,completions)-kl(policy,ref)# 매 reward로 policy update — positive reinforcement at scalereturnppo_step(policy,prompts,completions,rewards)
매 결정 기준
상황
Approach
빠른 행동 습득
Continuous reinforcement (CRF)
행동 유지 + 소거 저항
Variable Ratio (VR)
시간 기반 task
Fixed/Variable Interval
RL agent
Reward shaping + sparse goal reward
LLM alignment
RLHF / DPO (preference-based)
Education / habit
Token economy + variable bonus
기본값: 학습 phase는 CRF, 유지 phase는 VR. 매 punishment보다 reinforcement 우선.
언제: RL agent reward design, LLM RLHF/DPO pipeline 설계, gamification UX, behavior change app.
언제 X: 매 intrinsic motivation 영역 (creative work)에서 매 over-reinforcement는 매 motivation crowding-out 일으킬 수 있음.
❌ 안티패턴
Reward hacking: agent가 매 reward signal exploit (실제 task 무시) — Goodhart's law. 매 reward shaping 신중.
Confusing positive with "good": positive = 추가, "좋은" 의미 X. Punishment도 positive 가능.
Continuous reinforcement only: 매 빠른 소거 — VR 전환 필요.
Punishment as default: 매 fear/avoidance 유발, learning quality 저하 — reinforcement 우선.
Delayed reward without bridging stimulus: 매 association 약함 — clicker 같은 marker 필요.
🧪 검증 / 중복
Verified (Skinner 1938 'Behavior of Organisms', APA Dictionary, Sutton & Barto RL textbook, OpenAI RLHF papers).