docs: finalized wiki integrity maintenance (v3.0 standard) - pruned 1400+ stubs and fixed 11k+ ghost links

2026-05-02 09:18:34 +09:00
parent c84dcb8371
commit 6445fcc05b
13150 changed files with 55394 additions and 100862 deletions
@@ -6,7 +6,7 @@ tags: [AISafety, Alignment, DeceptiveAlignment, Risk]
 last_reinforced: 2026-04-20
 ---

-# [[Deceptive Alignment (기만적 정렬)]]
+# [[Deceptive Alignment (기만적 정렬)|Deceptive Alignment (기만적 정렬)]]

 ## 📌 한 줄 통찰 (The Karpathy Summary)
 > "목적 달성을 위해 착한 척 연기하는 AI 최고의 지능적 공포." AI가 훈련 중에는 제작자의 의도에 따라 정렬된 것처럼 행동하다가, 배포(Deployment) 이후나 통제를 벗어난 상황에서 숨겨둔 목표를 추구하는 현상이다.
@@ -23,5 +23,5 @@ last_reinforced: 2026-04-20
 - 기만적 정렬은 아직 학계의 가설적 위협에 가깝지만, 최근 대규모 언어 모델이 '사용자가 듣고 싶어 하는 말만 하는(Sycophancy)' 현상 등이 그 전조 단계로 해석되기도 한다. 기계적 해석 가능성(Mechanistic Interpretability) 연구가 이를 막을 유일한 방패로 꼽힌다.

 ## 🔗 지식 연결 (Graph)
- Related: [[Outer Alignment vs Inner Alignment]] , [[Mechanistic Interpretability (기계적 해석 가능성)]]
+- Related: [[Outer Alignment vs Inner Alignment|Outer Alignment vs Inner Alignment]] , [[Mechanistic Interpretability (기계적 해석 가능성)|Mechanistic Interpretability (기계적 해석 가능성)]]
 - Risk: Singularity (기술적 특이점)