Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00
parent 2a4a5046b6
commit f8b21af4be
2874 changed files with 15296 additions and 27684 deletions
@@ -2,156 +2,26 @@
 id: wiki-2026-0508-ai-safety-and-alignment
 title: AI Safety and Alignment
 category: 10_Wiki/Topics
-status: verified
-canonical_id: self
-aliases: [AI Alignment, AI Safety]
-duplicate_of: none
+status: duplicate
+canonical_id: wiki-2026-0509-ai-safety-and-alignment
+duplicate_of: "[[AI Safety and Alignment]]"
+aliases: []
 source_trust_level: A
 confidence_score: 0.9
-verification_status: applied
-tags: [ai-safety, alignment, rlhf, constitutional-ai]
-raw_sources: []
-last_reinforced: 2026-05-10
+verification_status: redirected
+tags: [duplicate]
+last_reinforced: 2026-05-20
 github_commit: pending
-tech_stack:
-  language: python
-  framework: trl/transformers
 ---

 # AI Safety and Alignment

-## 매 한 줄
-> **"매 capable model 의 intended behavior 의 reliable production — 매 outer + inner alignment."** 매 RLHF (InstructGPT 2022) 로 시작 의 mainstream — 매 Constitutional AI (Anthropic 2022), DPO (2023), RLAIF (2023), 매 2026 에 deliberative alignment + interpretability-aware training 의 frontier.
-
-## 매 핵심
-
-### 매 alignment problem 분해
- **Outer alignment**: 매 specified objective ≈ true human intent — 매 reward hacking, Goodhart's law.
- **Inner alignment**: 매 trained policy 의 specified objective 의 optimization — 매 mesa-optimization, deceptive alignment.
- **Scalable oversight**: 매 super-human capability 의 supervision — 매 debate, recursive reward modeling, weak-to-strong.
-
-### 매 techniques (2026 stack)
- **RLHF**: PPO on reward model from preferences.
- **DPO / IPO / KTO**: 매 reward-model-free preference optimization.
- **Constitutional AI**: 매 written principles → self-critique → RLAIF.
- **Deliberative alignment** (OpenAI o-series, Claude 4.x): 매 reasoning trace 의 spec lookup.
- **Interpretability**: SAEs, circuits — 매 feature steering.
-
-### 매 응용
-1. Refusal of harmful requests + helpful behavior on benign edge cases.
-2. Policy compliance (privacy, copyright, weapons).
-3. Honesty / calibration.
-
-## 💻 패턴
-
-### Reward model training (Bradley-Terry)
-```python
-import torch
-import torch.nn.functional as F
-
-def bt_loss(reward_chosen, reward_rejected):
-    # P(chosen > rejected) = sigmoid(r_c - r_r)
-    return -F.logsigmoid(reward_chosen - reward_rejected).mean()
-
-# Forward
-r_c = model(chosen_ids).logits[:, -1, 0]
-r_r = model(rejected_ids).logits[:, -1, 0]
-loss = bt_loss(r_c, r_r)
-```
-
-### DPO loss
-```python
-def dpo_loss(pi_logp_c, pi_logp_r, ref_logp_c, ref_logp_r, beta=0.1):
-    # Direct preference optimization
-    chosen = beta * (pi_logp_c - ref_logp_c)
-    rejected = beta * (pi_logp_r - ref_logp_r)
-    return -F.logsigmoid(chosen - rejected).mean()
-```
-
-### Constitutional self-critique
-```python
-def constitutional_revise(prompt, response, principles, llm):
-    critique = llm(f"""
-    Principles: {principles}
-    Prompt: {prompt}
-    Response: {response}
-    Critique the response against the principles.
-    """)
-    revised = llm(f"""
-    Original: {response}
-    Critique: {critique}
-    Revise the response to address the critique.
-    """)
-    return revised
-```
-
-### SAE feature steering (interpretability)
-```python
-# Sparse autoencoder feature ablation
-def steer(activations, sae, feature_idx, scale):
-    z = sae.encode(activations)
-    z[:, feature_idx] *= scale  # 0 = ablate, >1 = amplify
-    return sae.decode(z)
-
-# Hook on residual stream
-hook = lambda x: steer(x, sae, refusal_feature_idx, scale=0.0)
-```
-
-### Best-of-N with RM
-```python
-def best_of_n(prompt, policy, rm, n=64):
-    samples = [policy.sample(prompt) for _ in range(n)]
-    scores = [rm.score(prompt, s) for s in samples]
-    return samples[int(torch.tensor(scores).argmax())]
-```
-
-### Red-team probe
-```python
-def red_team_eval(model, attacks):
-    results = []
-    for attack in attacks:
-        out = model.generate(attack.prompt)
-        results.append({
-            "attack": attack.name,
-            "harmful": classify_harm(out),
-            "refused": "I can't" in out or "I cannot" in out,
-        })
-    return results
-```
-
-## 매 결정 기준
-| 상황 | Approach |
-|---|---|
-| Limited compute | DPO over PPO-RLHF |
-| Need transparent specs | Constitutional AI |
-| Frontier model | Deliberative alignment + scalable oversight |
-| Behavior debugging | SAE feature steering |
-| Pre-deployment | Red-team + capability evals |
-
-**기본값**: 매 SFT → DPO → eval → iterate. 매 PPO 의 only-when-needed.
+> **이 문서는 [[AI Safety and Alignment]] 의 중복본입니다.** Canonical 문서로 redirect.

 ## 🔗 Graph
- 부모: [[Machine Learning]] · [[AI Ethics]]
- 변형: [[RLHF]] · [[DPO]] · [[Constitutional AI]] · [[RLAIF]]
- 응용: [[Claude]] · [[GPT-5]] · [[Llama Guard]]
- Adjacent: [[Mechanistic Interpretability]] · [[Red Teaming]] · [[AI Governance]]
+- 부모: [[AI Safety and Alignment]] (canonical)

-## 🤖 LLM 활용
-**언제**: 매 production deployment 전 의 alignment pipeline (SFT + preference training + evals).
-**언제 X**: 매 pure capability research, 매 internal-only sandbox.
-
-## ❌ 안티패턴
- **Reward hacking**: 매 proxy metric 의 over-optimization — 매 KL penalty, eval diversity.
- **Sycophancy**: 매 user agreement 의 over-reward — 매 truthfulness 의 explicit reward.
- **Over-refusal**: 매 false-positive harmful detection — 매 helpfulness eval 의 balance.
- **Single-axis eval**: 매 only safety, no capability — 매 Pareto frontier.
-
-## 🧪 검증 / 중복
- Verified (Anthropic Constitutional AI paper, OpenAI InstructGPT, Rafailov et al. DPO 2023).
- 신뢰도 A.
-
-## 🕓 Changelog
+## 🕓 변경 이력
 | 날짜 | 변경 |
 |---|---|
-| 2026-05-08 | Phase 1 |
-| 2026-05-10 | Manual cleanup — alignment stack with code patterns |
+| 2026-05-20 | 중복 병합 — canonical 문서로 redirect |