---
id: wiki-2026-0508-sota
title: SOTA
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [State of the Art, SOTA Benchmark, Leaderboard]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [ml, benchmark, evaluation, research, leaderboard]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: agnostic
  framework: ml-research
---

# SOTA

## 매 한 줄
> **"매 SOTA = 매 task 의 best 알려진 result"**. State-of-the-Art 의 약자 — 매 ML/research 에서 매 specific benchmark 에 대한 매 highest-scoring approach 의 referrent. 매 2026 년 LLM/diffusion 시대 에서는 매 SOTA 가 매 weeks 단위 로 invalidated 되는 매 fast-moving target.

## 매 핵심

### 매 정의 / scope
- **매 task-bound**: 매 SOTA 는 매 always 매 specific benchmark + 매 metric 에 tied. "매 GPT-5 가 SOTA" → 매 vague. "매 GPT-5 가 매 GSM8K 의 SOTA (98.4%)" → 매 valid.
- **매 dataset split**: 매 same model 도 매 train/eval split 에 따라 매 ranking 변경 가능.
- **매 reproducibility crisis**: 매 SOTA claim 의 매 cherry-picking, 매 hyperparameter tuning, 매 test-set leakage issue 빈번.

### 매 modern (2026) landscape
- **LLM benchmarks**: MMLU-Pro, GPQA Diamond, SWE-Bench Verified, ARC-AGI-2, HLE (Humanity's Last Exam).
- **Coding**: SWE-Bench Verified, LiveCodeBench, Aider polyglot.
- **Vision**: ImageNet-2 (재구성 version), COCO 2025, GenAI-Bench (text-to-image).
- **매 2026 SOTA 양상**: Claude Opus 4.7, GPT-5, Gemini 3.5 Ultra 의 매 leapfrog. 매 monthly invalidation.

### 매 응용
1. **연구 paper**: 매 SOTA result 가 publication 의 매 main currency.
2. **Industry adoption**: 매 SOTA model 의 매 production fine-tune 의 base.
3. **Investment signal**: 매 lab 의 매 SOTA 달성 = 매 funding signal.

## 💻 패턴

### Pattern 1: SOTA evaluation harness
```python
# 매 evaluation 의 reproducibility 확보.
import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=meta-llama/Llama-3.3-70B-Instruct",
    tasks=["mmlu_pro", "gpqa_diamond", "math_hard"],
    batch_size=8,
    num_fewshot=0,
)
print(results["results"])
```

### Pattern 2: SOTA leaderboard scrape
```python
# Papers With Code API.
import requests

r = requests.get(
    "https://paperswithcode.com/api/v1/sota/sota-on-mmlu/",
    headers={"Accept": "application/json"},
)
top = r.json()["results"][0]
print(f"SOTA: {top['paper']['title']} — {top['metrics']}")
```

### Pattern 3: Statistical-significance check
```python
# 매 SOTA claim 의 매 noise vs 매 real improvement 구분.
from scipy.stats import bootstrap
import numpy as np

baseline = np.array(per_sample_scores_baseline)
candidate = np.array(per_sample_scores_candidate)

diff = candidate - baseline
ci = bootstrap((diff,), np.mean, n_resamples=10_000, confidence_level=0.95)
print(f"Δ mean = {diff.mean():.4f}, 95% CI = {ci.confidence_interval}")
# 매 CI 가 0 포함 → 매 not significant.
```

### Pattern 4: Test-set contamination probe
```python
# 매 model 의 매 train 시 test set 의 leak 확인.
from datasets import load_dataset

test_set = load_dataset("hendrycks/test", split="test")
sample = test_set[0]["question"]

# 매 perplexity test — 매 model 가 매 test sample 의 매 unusually low ppl ?
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("model-name")
mod = AutoModelForCausalLM.from_pretrained("model-name").cuda()
inputs = tok(sample, return_tensors="pt").to("cuda")
with torch.no_grad():
    loss = mod(**inputs, labels=inputs.input_ids).loss
print(f"PPL = {torch.exp(loss).item()}")
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| 매 새 paper 의 SOTA claim | Significance test + 매 reproduction 확인 |
| 매 production model 선택 | 매 SOTA 만 X — latency/cost/license 도 |
| 매 research direction | SOTA 의 매 1-2% 차이 chase X — 매 architectural insight 추구 |
| 매 benchmark saturated (≥99%) | 매 새 benchmark 로 이동 — Goodhart 회피 |

**기본값**: 매 SOTA = 매 starting reference, not endpoint. 매 다양 한 axes (latency, cost, robustness) 평가.

## 🔗 Graph
- 응용: [[Leaderboard]]
- Adjacent: [[Goodharts Law]]

## 🤖 LLM 활용
**언제**: 매 task 의 매 current SOTA 의 매 quick lookup, 매 benchmark 추천.
**언제 X**: 매 LLM 의 매 training cutoff 후 SOTA — 매 stale info, web search 사용.

## ❌ 안티패턴
- **SOTA chasing**: 매 0.1% improvement 만 매 chase — 매 diminishing returns.
- **Single-metric tunnel vision**: 매 accuracy 만 — 매 fairness/latency/robustness 무시.
- **Benchmark hacking**: 매 test-set tuning, 매 prompt engineering for benchmark only.

## 🧪 검증 / 중복
- Verified (Papers With Code; Open LLM Leaderboard; lm-evaluation-harness docs).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — SOTA 정의/landscape/eval harness/contamination probe 정리 |