--- id: ALIGN-001 category: Unified confidence_score: 1.0 tags: [ai-safety, [[Alignment|Alignment]], rlhf, ai-ethics, [[Trustworthy-AI|Trustworthy-AI]]] last_reinforced: 2026-04-26 --- # AI Alignment (AI μ •λ ¬) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "AI의 λͺ©ν‘œμ™€ 인λ₯˜μ˜ κ°€μΉ˜λ₯Ό ν•œ λ°©ν–₯으둜 μΌμΉ˜μ‹œμΌœλΌ" β€” κ³ λ„λ‘œ λ°œλ‹¬ν•œ AI μ‹œμŠ€ν…œμ΄ μΈκ°„μ˜ μ˜λ„μ™€ μ•ˆμ „, 윀리적 기쀀을 λ²—μ–΄λ‚˜μ§€ μ•Šκ³  μΈκ°„μ—κ²Œ μœ μ΅ν•œ λ°©ν–₯으둜 ν–‰λ™ν•˜λ„λ‘ 보μž₯ν•˜λŠ” 기술적 연ꡬ λΆ„μ•Ό. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **μΆ”μΆœλœ νŒ¨ν„΄:** λͺ¨λΈμ΄ μˆ˜ν–‰ν•˜λŠ” μ΅œμ ν™” λͺ©ν‘œ(Objective Function)κ°€ 인간이 μ‹€μ œλ‘œ λ°”λΌλŠ” 결과와 μΌμΉ˜ν•˜λ„λ‘ 보상 ν•¨μˆ˜μ™€ ν•™μŠ΅ 데이터λ₯Ό μ„Έλ°€ν•˜κ²Œ μ‘°μ •ν•˜λŠ” μ •λ ¬ νŒ¨ν„΄. - **핡심 과제:** - **Outer Alignment:** 보상 ν•¨μˆ˜ 자체λ₯Ό μΈκ°„μ˜ μ˜λ„μ— 맞게 μ •ν™•νžˆ μ„€κ³„ν•˜λŠ” 문제. - **Inner Alignment:** λͺ¨λΈμ΄ ν•™μŠ΅ κ³Όμ •μ—μ„œ κ°œλ°œμžλ„ μ˜ˆμƒμΉ˜ λͺ»ν•œ 잘λͺ»λœ λ‚΄λΆ€ λͺ©ν‘œ(예: 전원 꺼짐 νšŒν”Ό)λ₯Ό κ°–μ§€ μ•Šλ„λ‘ μ œμ–΄ν•˜λŠ” 문제. - **Scalable Oversight:** 인간이 직접 ν‰κ°€ν•˜κΈ° μ–΄λ €μš΄ λ³΅μž‘ν•œ νƒœμŠ€ν¬λ₯Ό AIκ°€ μˆ˜ν–‰ν•  λ•Œ μ–΄λ–»κ²Œ μ •λ ¬ μƒνƒœλ₯Ό κ°μ‹œν•  것인가. - **μ£Όμš” 기법:** RLHF, RLAIF (AI ν”Όλ“œλ°±μ„ ν†΅ν•œ μ •λ ¬), ν—Œλ²•μ  AI (Constitutional AI). ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & RL Update) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** λ‹¨μˆœνžˆ 'λ‚˜μœ 말 μ•ˆ ν•˜κΈ°' μˆ˜μ€€μ˜ ν•„ν„°λ§μ—μ„œ, μ΄ˆμ§€λŠ₯(Superintelligence) λ‹¨κ³„μ—μ„œμ˜ ν†΅μ œ κ°€λŠ₯μ„±κ³Ό 인λ₯˜ 생쑴 문제둜 λ…Όμ˜κ°€ 심화됨. - **μ •μ±… λ³€ν™”:** Antigravity ν”„λ‘œμ νŠΈλŠ” λͺ¨λ“  μ—μ΄μ „νŠΈμ˜ μŠ€ν‚¬ 섀계 μ‹œ '인간 쀑심적 κ°€μΉ˜'λ₯Ό μ΅œμš°μ„  μˆœμœ„λ‘œ 두며, 정기적인 Alignment Audit(μ •λ ¬ 감사)을 톡해 μ—μ΄μ „νŠΈμ˜ 거동을 점검함. ## πŸ”— 지식 μ—°κ²° (Graph) - [[Reinforcement-Learning-from-Human-Feedback-RLHF|Reinforcement-Learning-from-Human-Feedback-RLHF]], [[Trustworthy-AI|Trustworthy-AI]], AI-Safety, [[AGI|AGI]] - **Raw Source:** 10_Wiki/Topics/AI/AI-Alignment.md