--- id: [[P-Reinforce|P-Reinforce]]-AUTO-SAFE-001 category: Unified confidence_score: 0.98 tags: [auto-reinforced, ai-safety, constitutional-ai, alignment, anthropic, ethics] last_reinforced: 2026-05-04 --- # [[AI Safety & Constitutional AI|AI Safety & Constitutional AI]] ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "양심을 κ°€μ§„ 기계: μΈκ°„μ˜ 일일이 κ°œμž…ν•˜λŠ” μž”μ†Œλ¦¬ λŒ€μ‹ , 'ν—Œλ²•'이라 λΆˆλ¦¬λŠ” 핡심 원칙듀을 λͺ¨λΈ 슀슀둜 λ‚΄λ©΄ν™”ν•˜κ²Œ ν•˜μ—¬ μœ ν•΄μ„±μ„ κ±ΈλŸ¬λ‚΄κ³  인λ₯˜μ˜ κ°€μΉ˜μ— μ •λ ¬μ‹œν‚€λŠ” μ‹œμŠ€ν…œμ  윀리 κ°€λ“œλ ˆμΌ." ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) AI μ•ˆμ „(Safety)은 λͺ¨λΈμ΄ 인λ₯˜μ—κ²Œ ν•΄λ₯Ό λΌμΉ˜μ§€ μ•Šλ„λ‘ ν†΅μ œν•˜λŠ” 기술이며, Constitutional AI(ν—Œλ²•μ  AI)λŠ” 이λ₯Ό μ‹€ν˜„ν•˜λŠ” κ°€μž₯ μ§„λ³΄λœ 방법둠 쀑 ν•˜λ‚˜μž…λ‹ˆλ‹€. 1. **Constitutional AI (μ•€μŠ€λ‘œν”½)**: * **원리**: 인간이 λͺ¨λ“  닡변을 ν‰κ°€ν•˜λŠ” λŒ€μ‹ , λͺ…λ¬Έν™”λœ 'ν—Œλ²•(원칙)'을 μ œμ‹œν•˜κ³  λͺ¨λΈμ΄ 슀슀둜 μžμ‹ μ˜ 닡변을 ν‰κ°€ν•˜κ³  μˆ˜μ •(Self-critique)ν•˜κ²Œ ν•©λ‹ˆλ‹€. * **단계**: [AI ν”Όλ“œλ°± 생성] $\rightarrow$ [μˆ˜μ •λœ λ‹΅λ³€μœΌλ‘œ ν•™μŠ΅(RLAIF)]. * **효과**: λ§Ήλͺ©μ μœΌλ‘œ 닡변을 κ±°λΆ€ν•˜λŠ” 것이 μ•„λ‹ˆλΌ, λ§₯락을 μ΄ν•΄ν•˜λ©° μœ μ—°ν•˜κ²Œ μœ„ν—˜μ„ νšŒν”Όν•˜κ³  ν™˜κ° λŒ€μ‹  λΆˆν™•μ‹€μ„±μ„ μΈμ •ν•˜κ²Œ ν•©λ‹ˆλ‹€. 2. **핡심 μ•ˆμ „ 과제**: * **CBRN λ°©μ–΄**: ν™”ν•™(C), 생물(B), 방사λŠ₯(R), ν•΅(N)κ³Ό κ΄€λ ¨λœ μœ„ν—˜ 정보λ₯Ό μƒμ„±ν•˜μ§€ μ•Šλ„λ‘ μ •λ ¬ν•©λ‹ˆλ‹€. * **νƒˆμ˜₯(Jailbreak) λ°©μ§€**: μ•…μ˜μ μΈ ν”„λ‘¬ν”„νŠΈ μ£Όμž…μ„ 톡해 μ•ˆμ „ κ°€μ΄λ“œλΌμΈμ„ 무λ ₯ν™”ν•˜λ €λŠ” μ‹œλ„λ₯Ό μ°¨λ‹¨ν•©λ‹ˆλ‹€. * **Over-refusal μ™„ν™”**: λ„ˆλ¬΄ μ‘°μ‹¬μŠ€λŸ¬μ›Œμ„œ λ¬΄ν•΄ν•œ μ§ˆλ¬Έμ—λ„ 닡변을 κ±°λΆ€ν•˜λŠ” ν˜„μƒμ„ μ€„μ΄λŠ” 것이 ν˜„λŒ€ μ•ˆμ „ 기술의 μˆ™μ œμž…λ‹ˆλ‹€. 3. **RLAIF (RL from AI Feedback)**: * 인간 λŒ€μ‹  λ‹€λ₯Έ κ°•λ ₯ν•œ λͺ¨λΈ(Teacher model)의 ν”Όλ“œλ°±μ„ μ‚¬μš©ν•˜μ—¬ 효율적으둜 λŒ€κ·œλͺ¨ λͺ¨λΈμ„ μ •λ ¬ν•˜λŠ” κΈ°μˆ μž…λ‹ˆλ‹€. ## βš–οΈ Trade-offs & Caveats * **μ§€λŠ₯κ³Ό μ•ˆμ „μ˜ κ· ν˜•**: μ•ˆμ „ κ°€λ“œλ ˆμΌμ΄ λ„ˆλ¬΄ κ°•ν•˜λ©΄ λͺ¨λΈμ˜ μ°½μ˜μ„±μ΄λ‚˜ 문제 ν•΄κ²° λŠ₯λ ₯이 μ €ν•˜λ  수 μžˆμŠ΅λ‹ˆλ‹€. * **κ°€μΉ˜ 편ν–₯**: 'ν—Œλ²•'을 λˆ„κ°€, μ–΄λ–»κ²Œ μ •μ˜ν•˜λŠλƒμ— 따라 νŠΉμ • λ¬Έν™”λ‚˜ μ •μΉ˜μ  κ°€μΉ˜κ΄€μ΄ λͺ¨λΈμ— μ£Όμž…λ  μœ„ν—˜μ΄ μžˆμŠ΅λ‹ˆλ‹€. ## πŸ”— 지식 μ—°κ²° (Graph) * **μƒμœ„ κ°œλ…**: [[AI Governance|AI Governance]], [[Alignment|Alignment]] * **κ΄€λ ¨ λͺ¨λΈ**: [[Claude|Claude]] (ν—Œλ²•μ  AI의 μ„ κ΅¬μž) * **μ—°κ΄€ 기술**: [[RLHF & DPO|RLHF & DPO]], [[Prompt Injection|Prompt Injection]] --- *Last updated: 2026-05-04*