"매 capable model 의 intended behavior 의 reliable production — 매 outer + inner alignment." 매 RLHF (InstructGPT 2022) 로 시작 의 mainstream — 매 Constitutional AI (Anthropic 2022), DPO (2023), RLAIF (2023), 매 2026 에 deliberative alignment + interpretability-aware training 의 frontier.
매 핵심
매 alignment problem 분해
Outer alignment: 매 specified objective ≈ true human intent — 매 reward hacking, Goodhart's law.
Inner alignment: 매 trained policy 의 specified objective 의 optimization — 매 mesa-optimization, deceptive alignment.
Scalable oversight: 매 super-human capability 의 supervision — 매 debate, recursive reward modeling, weak-to-strong.
매 techniques (2026 stack)
RLHF: PPO on reward model from preferences.
DPO / IPO / KTO: 매 reward-model-free preference optimization.
Constitutional AI: 매 written principles → self-critique → RLAIF.
Deliberative alignment (OpenAI o-series, Claude 4.x): 매 reasoning trace 의 spec lookup.
Interpretability: SAEs, circuits — 매 feature steering.
매 응용
Refusal of harmful requests + helpful behavior on benign edge cases.
defdpo_loss(pi_logp_c,pi_logp_r,ref_logp_c,ref_logp_r,beta=0.1):# Direct preference optimizationchosen=beta*(pi_logp_c-ref_logp_c)rejected=beta*(pi_logp_r-ref_logp_r)return-F.logsigmoid(chosen-rejected).mean()
Constitutional self-critique
defconstitutional_revise(prompt,response,principles,llm):critique=llm(f"""
Principles: {principles} Prompt: {prompt} Response: {response} Critique the response against the principles.
""")revised=llm(f"""
Original: {response} Critique: {critique} Revise the response to address the critique.
""")returnrevised