--- id: wiki-2026-0508-ai-safety-and-alignment title: AI Safety and Alignment category: 10_Wiki/Topics status: verified canonical_id: self aliases: [AI Alignment, AI Safety] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [ai-safety, alignment, rlhf, constitutional-ai] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: trl/transformers --- # AI Safety and Alignment ## 매 한 줄 > **"매 capable model 의 intended behavior 의 reliable production — 매 outer + inner alignment."** 매 RLHF (InstructGPT 2022) 로 시작 의 mainstream — 매 Constitutional AI (Anthropic 2022), DPO (2023), RLAIF (2023), 매 2026 에 deliberative alignment + interpretability-aware training 의 frontier. ## 매 핵심 ### 매 alignment problem 분해 - **Outer alignment**: 매 specified objective ≈ true human intent — 매 reward hacking, Goodhart's law. - **Inner alignment**: 매 trained policy 의 specified objective 의 optimization — 매 mesa-optimization, deceptive alignment. - **Scalable oversight**: 매 super-human capability 의 supervision — 매 debate, recursive reward modeling, weak-to-strong. ### 매 techniques (2026 stack) - **RLHF**: PPO on reward model from preferences. - **DPO / IPO / KTO**: 매 reward-model-free preference optimization. - **Constitutional AI**: 매 written principles → self-critique → RLAIF. - **Deliberative alignment** (OpenAI o-series, Claude 4.x): 매 reasoning trace 의 spec lookup. - **Interpretability**: SAEs, circuits — 매 feature steering. ### 매 응용 1. Refusal of harmful requests + helpful behavior on benign edge cases. 2. Policy compliance (privacy, copyright, weapons). 3. Honesty / calibration. ## 💻 패턴 ### Reward model training (Bradley-Terry) ```python import torch import torch.nn.functional as F def bt_loss(reward_chosen, reward_rejected): # P(chosen > rejected) = sigmoid(r_c - r_r) return -F.logsigmoid(reward_chosen - reward_rejected).mean() # Forward r_c = model(chosen_ids).logits[:, -1, 0] r_r = model(rejected_ids).logits[:, -1, 0] loss = bt_loss(r_c, r_r) ``` ### DPO loss ```python def dpo_loss(pi_logp_c, pi_logp_r, ref_logp_c, ref_logp_r, beta=0.1): # Direct preference optimization chosen = beta * (pi_logp_c - ref_logp_c) rejected = beta * (pi_logp_r - ref_logp_r) return -F.logsigmoid(chosen - rejected).mean() ``` ### Constitutional self-critique ```python def constitutional_revise(prompt, response, principles, llm): critique = llm(f""" Principles: {principles} Prompt: {prompt} Response: {response} Critique the response against the principles. """) revised = llm(f""" Original: {response} Critique: {critique} Revise the response to address the critique. """) return revised ``` ### SAE feature steering (interpretability) ```python # Sparse autoencoder feature ablation def steer(activations, sae, feature_idx, scale): z = sae.encode(activations) z[:, feature_idx] *= scale # 0 = ablate, >1 = amplify return sae.decode(z) # Hook on residual stream hook = lambda x: steer(x, sae, refusal_feature_idx, scale=0.0) ``` ### Best-of-N with RM ```python def best_of_n(prompt, policy, rm, n=64): samples = [policy.sample(prompt) for _ in range(n)] scores = [rm.score(prompt, s) for s in samples] return samples[int(torch.tensor(scores).argmax())] ``` ### Red-team probe ```python def red_team_eval(model, attacks): results = [] for attack in attacks: out = model.generate(attack.prompt) results.append({ "attack": attack.name, "harmful": classify_harm(out), "refused": "I can't" in out or "I cannot" in out, }) return results ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Limited compute | DPO over PPO-RLHF | | Need transparent specs | Constitutional AI | | Frontier model | Deliberative alignment + scalable oversight | | Behavior debugging | SAE feature steering | | Pre-deployment | Red-team + capability evals | **기본값**: 매 SFT → DPO → eval → iterate. 매 PPO 의 only-when-needed. ## 🔗 Graph - 부모: [[Machine Learning]] · [[AI Ethics]] - 변형: [[RLHF]] · [[DPO]] · [[Constitutional AI]] · [[RLAIF]] - 응용: [[Claude]] · [[GPT-5]] · [[Llama Guard]] - Adjacent: [[Mechanistic Interpretability]] · [[Red Teaming]] · [[AI Governance]] ## 🤖 LLM 활용 **언제**: 매 production deployment 전 의 alignment pipeline (SFT + preference training + evals). **언제 X**: 매 pure capability research, 매 internal-only sandbox. ## ❌ 안티패턴 - **Reward hacking**: 매 proxy metric 의 over-optimization — 매 KL penalty, eval diversity. - **Sycophancy**: 매 user agreement 의 over-reward — 매 truthfulness 의 explicit reward. - **Over-refusal**: 매 false-positive harmful detection — 매 helpfulness eval 의 balance. - **Single-axis eval**: 매 only safety, no capability — 매 Pareto frontier. ## 🧪 검증 / 중복 - Verified (Anthropic Constitutional AI paper, OpenAI InstructGPT, Rafailov et al. DPO 2023). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — alignment stack with code patterns |