Files
2nd/10_Wiki/Topics/AI_and_ML/Assessment.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

275 lines
8.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-assessment
title: Assessment (Educational + ML Evaluation)
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [평가, evaluation, formative, summative, validity, reliability, rubric, ml-evaluation]
duplicate_of: none
source_trust_level: B
confidence_score: 0.88
verification_status: applied
tags: [assessment, evaluation, education, validity, reliability, fairness, rubric, ml-eval, llm-judge]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: education / ML
applicable_to: [Educational Tech, ML Evaluation, Performance Review]
---
# Assessment
## 📌 한 줄 통찰
> **"매 성장 의 거울"**. 매 current 의 measure + 매 gap → 매 direction. 매 selection 의 X — 매 growth 의 support. 매 modern AI 의 ML evaluation 의 same principle (validity / reliability / fairness).
## 📖 핵심
### 매 timing 의 분류
1. **Diagnostic** (진단): 매 시작 전 의 수준.
2. **Formative** (형성): 매 진행 중 의 feedback.
3. **Summative** (총괄): 매 final 의 성취.
4. **Authentic**: 매 real-world task.
### 매 quality criteria
- **Validity** (타당도): 매 measure 의 right thing?
- **Construct**: 매 construct 의 capture.
- **Content**: 매 domain 의 cover.
- **Predictive**: 매 future 의 predict.
- **Face**: 매 looks-like-it.
- **Reliability** (신뢰도): 매 consistent?
- **Test-retest**: 매 시간 의 stable.
- **Inter-rater**: 매 rater 의 agree.
- **Internal consistency** (Cronbach's α).
- **Fairness**: 매 equal opportunity.
- **Authenticity**: 매 real-world ≈.
### 매 educational paradigm
#### Behaviorist (전통)
- 매 multiple choice.
- 매 right/wrong.
#### Cognitivist
- 매 understanding.
- 매 short answer / explain.
#### Constructivist
- 매 portfolio.
- 매 project.
- 매 self/peer reflection.
### 매 ML evaluation 의 parallel
| Education | ML |
|---|---|
| Validity | 매 construct 의 measure |
| Reliability | 매 consistent across runs |
| Fairness | 매 group equity |
| Diagnostic | 매 capability profiling |
| Formative | 매 dev set |
| Summative | 매 test set |
| Authentic | 매 real-world deploy |
### 매 modern issue
#### LLM-as-judge
- 매 fast + 매 cheap.
- 매 self-bias (GPT-4 가 GPT-4 의 favor).
- 매 calibration 필요.
#### Multi-dimensional
- 매 single metric 의 X.
- 매 quality + safety + cost + latency.
#### Adaptive
- 매 IRT (Item Response Theory).
- 매 difficulty 의 adapt.
- 매 GRE / 매 personalized education.
#### Continuous
- 매 portfolio.
- 매 logging-based.
- 매 longitudinal.
### 매 rubric (good)
- 매 specific criteria.
- 매 levels (4-6).
- 매 anchored example.
- 매 actionable feedback.
## 💻 패턴
### Rubric (educational)
```yaml
# 매 essay rubric
criteria:
- name: Argument
levels:
4: "Sophisticated argument with nuance and counter-evidence"
3: "Clear argument with relevant support"
2: "Argument present but weakly supported"
1: "No clear argument or off-topic"
- name: Evidence
levels:
4: "Multiple high-quality sources, integrated"
3: "Adequate sources cited"
2: "Few or weak sources"
1: "No evidence or invented"
- name: Writing
levels:
4: "Polished, varied, error-free"
3: "Clear, mostly correct"
2: "Comprehensible but error-laden"
1: "Incomprehensible"
scoring: weighted_sum # 매 levels[criterion] * weight
```
### LLM-as-judge (educational)
```python
def judge_essay(essay, rubric):
prompt = f"""Score this essay against the rubric. Return JSON.
Rubric: {rubric}
Essay:
{essay}
Format:
{{
"argument": {{ "score": 1-4, "evidence": "..." }},
"evidence": {{ "score": 1-4, "evidence": "..." }},
"writing": {{ "score": 1-4, "evidence": "..." }},
"feedback": "actionable feedback in 3 sentences"
}}"""
response = llm.generate(prompt)
return json.loads(response)
# 매 calibration
# 매 N=3 judge → 매 average. 매 disagreement → 매 human review.
```
### Inter-rater agreement (Cohen's kappa)
```python
from sklearn.metrics import cohen_kappa_score
def measure_reliability(rater1_scores, rater2_scores):
kappa = cohen_kappa_score(rater1_scores, rater2_scores)
if kappa < 0.4: return 'poor'
if kappa < 0.6: return 'fair'
if kappa < 0.8: return 'good'
return 'excellent'
```
### IRT (adaptive testing)
```python
import numpy as np
def irt_3pl(theta, a, b, c):
"""매 3-parameter logistic.
theta: ability, a: discrimination, b: difficulty, c: guessing."""
return c + (1 - c) / (1 + np.exp(-a * (theta - b)))
def adaptive_next_item(theta_estimate, item_pool, answered_ids):
# 매 information 의 maximum 의 item.
candidates = [item for item in item_pool if item.id not in answered_ids]
info = lambda item: item.a**2 * irt_3pl(theta_estimate, item.a, item.b, item.c) * \
(1 - irt_3pl(theta_estimate, item.a, item.b, item.c))
return max(candidates, key=info)
```
### Fairness check (group)
```python
def fairness_check(scores, group_labels):
by_group = collections.defaultdict(list)
for score, group in zip(scores, group_labels):
by_group[group].append(score)
means = {g: np.mean(s) for g, s in by_group.items()}
# 매 disparate impact
max_mean = max(means.values())
min_mean = min(means.values())
if min_mean / max_mean < 0.8:
return f'WARN: disparate impact: {min_mean/max_mean:.2f} < 0.8'
return 'OK'
```
### Portfolio assessment
```python
class Portfolio:
def __init__(self, student_id):
self.student_id = student_id
self.artifacts = []
def add(self, artifact):
self.artifacts.append({
'id': artifact.id,
'date': artifact.date,
'type': artifact.type, # essay, code, image
'reflection': artifact.reflection,
})
def progression(self):
# 매 시간 의 growth 의 visualize
scores_over_time = [(a.date, a.score) for a in self.artifacts]
return scores_over_time
```
### ML evaluation suite (multi-dim)
```python
def evaluate_model(model, eval_set):
return {
'accuracy': accuracy(model, eval_set),
'fairness': fairness_check(model, eval_set, sensitive='gender'),
'safety': safety_score(model, harm_set),
'calibration': ece(model, eval_set),
'latency_p95': latency(model),
'cost_per_1k': cost(model),
'human_pref': pairwise_human(model, baseline, n=100),
}
```
## 🤔 결정 기준
| 상황 | Approach |
|---|---|
| Standardized test | Summative + IRT |
| Personalized learning | Diagnostic + adaptive |
| Skill development | Formative + portfolio |
| LLM evaluation | Multi-metric + LLM-judge + human |
| Hiring | Authentic + rubric + structured |
| Performance review | 360° + portfolio |
**기본값**: Multi-method + rubric + inter-rater check + fairness audit.
## 🔗 Graph
- 부모: [[Evaluation]]
- 응용: [[Rubric]]
- ML parallel: [[ML-Evaluation]] · [[Benchmarks]] · [[LLM-as-Judge]] · [[Bias-Correction-Algorithm]]
- Adjacent: [[Algorithmic Fairness]] · [[Validity]] · [[Reliability]]
## 🤖 LLM 활용
**언제**: 매 educational system design. 매 ML evaluation suite. 매 performance review framework. 매 rubric 작성.
**언제 X**: 매 single high-stakes metric (Goodhart). 매 fairness 의 ignore.
## ❌ 안티패턴
- **Single-metric**: 매 saturate / game.
- **No rubric**: 매 inter-rater disagreement.
- **Stale benchmark**: 매 contamination.
- **No fairness check**: 매 disparate impact.
- **Diagnostic 의 stigma**: 매 student labeling.
- **LLM judge 의 single**: 매 self-bias.
- **No validation 의 construct**: 매 wrong thing measured.
## 🧪 검증 / 중복
- Verified (educational psychology + ML evaluation literature).
- 신뢰도 B.
- Related: [[Benchmarks]] · [[Bias-Correction-Algorithm]] · [[Algorithmic Fairness]] · [[LLM-as-Judge]].
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — type + criteria + ML parallel + rubric / IRT / fairness code |