Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

6.2 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Positive Reinforcement

매 한 줄

"매 행동 직후 desirable stimulus 추가 → 그 행동 빈도 증가.". Skinner의 operant conditioning 핵심 mechanism (1938~). Modern AI에서 매 RL의 reward signal과 직접 연결되며, RLHF / Constitutional AI / DPO의 conceptual root.

매 핵심

매 4-사분면 (Operant Conditioning)

	자극 추가 (positive)	자극 제거 (negative)
행동 증가 (reinforcement)	Positive Reinforcement (칭찬, 보상)	Negative Reinforcement (시끄러운 소리 멈춤)
행동 감소 (punishment)	Positive Punishment (혼냄)	Negative Punishment (특권 박탈)

매 "positive" = 추가, "negative" = 제거. 좋고 나쁨이 아님.

매 schedule (강화 스케줄)

Continuous (CRF): 매 행동마다 reward — 빠른 학습, 빠른 소거.
Fixed Ratio (FR): 매 N회 행동 후 — piecework.
Variable Ratio (VR): 평균 N회, 매 unpredictable — 도박, SNS 알림. 매 가장 강력하고 소거 저항.
Fixed Interval (FI): 매 N초 후 첫 행동.
Variable Interval (VI): 평균 N초, random — 매 steady response rate.

매 RL 연결

Reward signal r_t = positive reinforcement 의 mathematical formalization.
Policy gradient: 매 reward 받은 action 의 probability 증가 — 정확히 positive reinforcement.
RLHF: human preference → reward model → policy update — 매 large-scale positive reinforcement.

매 응용

Education (token economy, gamification).
Animal training (clicker training).
ABA therapy for autism.
Workplace incentive design.
App engagement (variable reward — Hooked Model).
RL agent training (game, robotics, LLM).

💻 패턴

Policy gradient (REINFORCE) — positive reinforcement formalized

import torch, torch.nn.functional as F
def reinforce_step(policy, optim, states, actions, rewards, gamma=0.99):
    # discounted return
    R, returns = 0.0, []
    for r in reversed(rewards):
        R = r + gamma * R
        returns.insert(0, R)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)

    logits = policy(torch.stack(states))
    logp = F.log_softmax(logits, dim=-1)
    chosen = logp.gather(1, torch.tensor(actions).unsqueeze(1)).squeeze(1)
    loss = -(chosen * returns).mean()  # 매 reward-weighted log-likelihood
    optim.zero_grad(); loss.backward(); optim.step()

Reward shaping (sparse → dense)

def shaped_reward(state, next_state, goal):
    progress = -abs(next_state - goal) + abs(state - goal)
    return 1.0 if next_state == goal else 0.1 * progress  # 매 step마다 작은 positive

Variable ratio schedule simulator

import random
class VariableRatio:
    def __init__(self, mean_n=5):
        self.mean = mean_n; self.count = 0; self.target = self._draw()
    def _draw(self):
        return max(1, int(random.expovariate(1/self.mean)))
    def step(self):
        self.count += 1
        if self.count >= self.target:
            self.count = 0; self.target = self._draw()
            return True   # reward
        return False

Token economy (educational app)

class TokenEconomy:
    def __init__(self): self.tokens = 0
    def reinforce(self, behavior, weight=1):
        # 매 desired behavior 직후 token 추가 (positive reinforcement)
        self.tokens += weight
    def redeem(self, cost, item):
        if self.tokens >= cost:
            self.tokens -= cost; return item

RLHF reward model (modern LLM positive reinforcement at scale)

# pseudocode of preference -> reward -> PPO
def train_reward_model(prefs):     # prefs: (chosen, rejected) pairs
    # log-sigmoid pairwise loss
    return ...
def ppo_update(policy, ref, rm, prompts):
    completions = policy.sample(prompts)
    rewards = rm(prompts, completions) - kl(policy, ref)
    # 매 reward로 policy update — positive reinforcement at scale
    return ppo_step(policy, prompts, completions, rewards)

매 결정 기준

상황	Approach
빠른 행동 습득	Continuous reinforcement (CRF)
행동 유지 + 소거 저항	Variable Ratio (VR)
시간 기반 task	Fixed/Variable Interval
RL agent	Reward shaping + sparse goal reward
LLM alignment	RLHF / DPO (preference-based)
Education / habit	Token economy + variable bonus

기본값: 학습 phase는 CRF, 유지 phase는 VR. 매 punishment보다 reinforcement 우선.

🔗 Graph

부모: Operant_Conditioning
응용: Reinforcement_Learning · RLHF · Gamification
Adjacent: Policy_Gradient

🤖 LLM 활용

언제: RL agent reward design, LLM RLHF/DPO pipeline 설계, gamification UX, behavior change app. 언제 X: 매 intrinsic motivation 영역 (creative work)에서 매 over-reinforcement는 매 motivation crowding-out 일으킬 수 있음.

❌ 안티패턴

Reward hacking: agent가 매 reward signal exploit (실제 task 무시) — Goodhart's law. 매 reward shaping 신중.
Confusing positive with "good": positive = 추가, "좋은" 의미 X. Punishment도 positive 가능.
Continuous reinforcement only: 매 빠른 소거 — VR 전환 필요.
Punishment as default: 매 fear/avoidance 유발, learning quality 저하 — reinforcement 우선.
Delayed reward without bridging stimulus: 매 association 약함 — clicker 같은 marker 필요.

🧪 검증 / 중복

Verified (Skinner 1938 'Behavior of Organisms', APA Dictionary, Sutton & Barto RL textbook, OpenAI RLHF papers).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — operant conditioning quadrants + RL/RLHF connection

6.2 KiB Raw Blame History