--- id: [[P-Reinforce|P-Reinforce]]-AUTO-EVBM-001 category: Unified confidence_score: 1.00 tags: [auto-reinforced, ai-evaluation, benchmarks, niah, ruler, mmlu, lmsys, evaluation-metrics] last_reinforced: 2026-05-04 --- # [[AI Evaluation & Benchmarks|AI Evaluation & Benchmarks]] ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "μ§€λŠ₯의 척도: λͺ¨λΈμ˜ μ„±λŠ₯을 λ‹¨μˆœνžˆ 'μ’‹λ‹€'κ³  λ§ν•˜λŠ” λŒ€μ‹ , μˆ˜ν•™, μ½”λ”©, 상식, 그리고 백만 토큰 μ†μ—μ„œμ˜ κΈ°μ–΅λ ₯ λ“± μ •λŸ‰μ  μ§€ν‘œλ₯Ό 톡해 λͺ¨λΈμ˜ μ‹€μ§ˆμ μΈ 체급을 μΈ‘μ •ν•˜λŠ” ν‘œμ€€ν™”λœ μ‹œν—˜μ§€." ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) AI λͺ¨λΈμ˜ λŠ₯λ ₯을 κ°κ΄€μ μœΌλ‘œ λΉ„κ΅ν•˜κ³  ν•œκ³„λ₯Ό νŒŒμ•…ν•˜κΈ° μœ„ν•œ ν‘œμ€€ν™”λœ 평가 μ§€ν‘œλ“€μž…λ‹ˆλ‹€. 1. **전톡적 벀치마크**: * **MMLU (Massive Multitask Language Understanding)**: 인문학, μ‚¬νšŒκ³Όν•™, μˆ˜ν•™ λ“± 57개 μ£Όμ œμ— λŒ€ν•œ 지식을 μΈ‘μ •ν•˜λŠ” ν‘œμ€€ μ‹œν—˜. * **HumanEval / MBPP**: λͺ¨λΈμ˜ 파이썬 μ½”λ“œ 생성 λŠ₯λ ₯을 평가. * **GSM8K**: μ΄ˆλ“±ν•™κ΅ μˆ˜μ€€μ˜ 닀단계 μˆ˜ν•™ λ¬Έμž₯제 문제 ν•΄κ²° λŠ₯λ ₯ μΈ‘μ •. 2. **λ‘± μ»¨ν…μŠ€νŠΈ 벀치마크**: * **Needle In A Haystack (NIAH)**: κ±°λŒ€ λ¬Έλ§₯ 속 νŠΉμ • 정보 검색 λŠ₯λ ₯을 μ‹œκ°μ  λ„ν‘œλ‘œ 확인. * **RULER**: λ‹¨μˆœ 검색을 λ„˜μ–΄ μš”μ•½, μΆ”λ‘  λ“± λ³΅μž‘ν•œ λ‘± μ»¨ν…μŠ€νŠΈ ν™œμš© λŠ₯λ ₯을 μ’…ν•© 평가. 3. **μ‹€μ „ 및 μ—μ΄μ „νŠΈ 평가**: * **LMSYS Chatbot Arena**: μ‹€μ œ μ‚¬μš©μžλ“€μ˜ λΈ”λΌμΈλ“œ ν…ŒμŠ€νŠΈλ₯Ό ν†΅ν•œ μ—˜λ‘œ(Elo) λ ˆμ΄νŒ… μ‹œμŠ€ν…œ. * **MCP-Atlas**: [[Model Context Protocol (MCP)|MCP]]λ₯Ό ν™œμš©ν•œ 도ꡬ 톡합 및 μ˜€μΌ€μŠ€νŠΈλ ˆμ΄μ…˜ μ„±λŠ₯ μΈ‘μ •. * **SWE-bench**: μ‹€μ œ μ˜€ν”ˆμ†ŒμŠ€ GitHub 이슈λ₯Ό λͺ¨λΈμ΄ 직접 ν•΄κ²°ν•  수 μžˆλŠ”μ§€ μΈ‘μ •. ## βš–οΈ Trade-offs & Caveats * **데이터 μ˜€μ—Ό (Contamination)**: 평가 데이터가 λͺ¨λΈμ˜ ν•™μŠ΅ 데이터에 ν¬ν•¨λ˜μ–΄, μ‹€μ œ μ§€λŠ₯보닀 μ μˆ˜κ°€ λ†’κ²Œ λ‚˜μ˜€λŠ” 'μ•”κΈ°ν˜• 점수' λ¬Έμ œκ°€ μ‹¬κ°ν•©λ‹ˆλ‹€. * **Goodhart's Law**: μ§€ν‘œκ°€ λͺ©ν‘œκ°€ λ˜λŠ” μˆœκ°„, κ·Έ μ§€ν‘œλŠ” 더 이상 쒋은 μ§€ν‘œκ°€ μ•„λ‹ˆκ²Œ λ©λ‹ˆλ‹€. (μ μˆ˜λ§Œμ„ 높이기 μœ„ν•œ νŽΈλ²• ν•™μŠ΅ μ„±ν–‰) ## πŸ”— 지식 μ—°κ²° (Graph) * **μ„±λŠ₯ κ΄€λ ¨**: [[LLM Capabilities|LLM Capabilities]], [[Reasoning Models|Reasoning Models]] * **기술 κ΄€λ ¨**: [[Context Window & Long-Context LLMs|Context Window]], [[Tool Use & Function Calling|Tool Use]] --- *Last updated: 2026-05-04*