--- id: [[P-Reinforce|P-Reinforce]]-AI-SAFETY category: Unified confidence_score: 1.0 tags: [[AI Safety|[AI Safety]], [[Alignment|Alignment]], Risk [[Management|Management]], AI Ethics] last_reinforced: 2026-04-20 --- # AI-Safety (AI μ•ˆμ „) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "브레이크 μ—†λŠ” κΈ°μ°¨λŠ” μž¬μ•™μ΄λ‹€." 인간보닀 κ°•λ ₯ν•œ μ§€λŠ₯이 νƒ„μƒν–ˆμ„ λ•Œ, κ·Έ μ§€λŠ₯이 μΈκ°„μ˜ λͺ©ν‘œμ™€ λ¬Έλͺ…을 νŒŒκ΄΄ν•˜μ§€ μ•Šλ„λ‘ 기술적/방어적 λ³΄ν˜Έλ§‰μ„ κ΅¬μΆ•ν•˜λŠ” κ°€μž₯ μ‹œκΈ‰ν•œ 연ꡬ λΆ„μ•Όλ‹€. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **[[Robustness|Robustness]]**: - μ λŒ€μ  곡격(Adversarial Attack)μ΄λ‚˜ 처음 λ³΄λŠ” 돌발 μƒν™©μ—μ„œλ„ AIκ°€ μ˜€μž‘λ™ν•˜μ§€ μ•Šκ³  μ•ˆμ „ν•˜κ²Œ κ΄€λ¦¬λ˜λŠ” μ„±μ§ˆ. - **[[Interpretability|Interpretability]]**: - μ‹ κ²½λ§μ΄λΌλŠ” λΈ”λž™λ°•μŠ€ λ‚΄λΆ€μ—μ„œ μ–΄λ–€ 논리 ꡬ쑰둜 νŒλ‹¨μ„ λ‚΄λ¦¬λŠ”μ§€ 인간이 읽을 수 있게 μ‹œκ°ν™”ν•˜κ³  λΆ„μ„ν•˜λŠ” 기술(Mechanistic Interpretability). - **Scalable Oversight**: - 인간이 μ΄ν•΄ν•˜κΈ° νž˜λ“  λ³΅μž‘ν•œ μ§€λŠ₯을 κ°€μ§„ AIλ₯Ό λ‹€λ₯Έ AIκ°€ κ°μ‹œν•˜κ²Œ ν•˜μ—¬, μΈκ°„μ˜ ν†΅μ œλ ₯을 μžƒμ§€ μ•Šκ²Œ ν•˜λŠ” κ°μ‹œ 체계. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (RL Update) - AI μ•ˆμ „μ€ μ’…μ’… λͺ¨λΈμ˜ μ„±λŠ₯ λ°œμ „μ„ λŠ¦μΆ˜λ‹€λŠ” λΉ„νŒμ„ λ°›λŠ”λ‹€. κ·ΈλŸ¬λ‚˜ 졜근 연ꡬ에 λ”°λ₯΄λ©΄, μ•ˆμ „ν•˜κ²Œ μ„€κ³„λœ λͺ¨λΈ(Aligned model)이 μ •μ œλœ 사고 λŠ₯λ ₯ 덕뢄에 μ‹€μ œ 싀무 μ„±λŠ₯도 더 λ†’κ²Œ λ‚˜νƒ€λ‚˜λŠ” 'λ³΄μ•ˆ-μ„±λŠ₯ μ‹œλ„ˆμ§€'κ°€ ν™•μΈλ˜κ³  μžˆλ‹€. ## πŸ”— 지식 μ—°κ²° (Graph) - Related: [[AI-Alignment|AI-Alignment]] , AI-Governance - [[Strategy|Strategy]]: [[Reliability_Safety_First|Reliability_Safety_First]]