--- id: wiki-2026-0508-assessment title: Assessment (Educational + ML Evaluation) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [평가, evaluation, formative, summative, validity, reliability, rubric, ml-evaluation] duplicate_of: none source_trust_level: B confidence_score: 0.88 verification_status: applied tags: [assessment, evaluation, education, validity, reliability, fairness, rubric, ml-eval, llm-judge] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: education / ML applicable_to: [Educational Tech, ML Evaluation, Performance Review] --- # Assessment ## 📌 한 줄 통찰 > **"매 성장 의 거울"**. 매 current 의 measure + 매 gap → 매 direction. 매 selection 의 X — 매 growth 의 support. 매 modern AI 의 ML evaluation 의 same principle (validity / reliability / fairness). ## 📖 핵심 ### 매 timing 의 분류 1. **Diagnostic** (진단): 매 시작 전 의 수준. 2. **Formative** (형성): 매 진행 중 의 feedback. 3. **Summative** (총괄): 매 final 의 성취. 4. **Authentic**: 매 real-world task. ### 매 quality criteria - **Validity** (타당도): 매 measure 의 right thing? - **Construct**: 매 construct 의 capture. - **Content**: 매 domain 의 cover. - **Predictive**: 매 future 의 predict. - **Face**: 매 looks-like-it. - **Reliability** (신뢰도): 매 consistent? - **Test-retest**: 매 시간 의 stable. - **Inter-rater**: 매 rater 의 agree. - **Internal consistency** (Cronbach's α). - **Fairness**: 매 equal opportunity. - **Authenticity**: 매 real-world ≈. ### 매 educational paradigm #### Behaviorist (전통) - 매 multiple choice. - 매 right/wrong. #### Cognitivist - 매 understanding. - 매 short answer / explain. #### Constructivist - 매 portfolio. - 매 project. - 매 self/peer reflection. ### 매 ML evaluation 의 parallel | Education | ML | |---|---| | Validity | 매 construct 의 measure | | Reliability | 매 consistent across runs | | Fairness | 매 group equity | | Diagnostic | 매 capability profiling | | Formative | 매 dev set | | Summative | 매 test set | | Authentic | 매 real-world deploy | ### 매 modern issue #### LLM-as-judge - 매 fast + 매 cheap. - 매 self-bias (GPT-4 가 GPT-4 의 favor). - 매 calibration 필요. #### Multi-dimensional - 매 single metric 의 X. - 매 quality + safety + cost + latency. #### Adaptive - 매 IRT (Item Response Theory). - 매 difficulty 의 adapt. - 매 GRE / 매 personalized education. #### Continuous - 매 portfolio. - 매 logging-based. - 매 longitudinal. ### 매 rubric (good) - 매 specific criteria. - 매 levels (4-6). - 매 anchored example. - 매 actionable feedback. ## 💻 패턴 ### Rubric (educational) ```yaml # 매 essay rubric criteria: - name: Argument levels: 4: "Sophisticated argument with nuance and counter-evidence" 3: "Clear argument with relevant support" 2: "Argument present but weakly supported" 1: "No clear argument or off-topic" - name: Evidence levels: 4: "Multiple high-quality sources, integrated" 3: "Adequate sources cited" 2: "Few or weak sources" 1: "No evidence or invented" - name: Writing levels: 4: "Polished, varied, error-free" 3: "Clear, mostly correct" 2: "Comprehensible but error-laden" 1: "Incomprehensible" scoring: weighted_sum # 매 levels[criterion] * weight ``` ### LLM-as-judge (educational) ```python def judge_essay(essay, rubric): prompt = f"""Score this essay against the rubric. Return JSON. Rubric: {rubric} Essay: {essay} Format: {{ "argument": {{ "score": 1-4, "evidence": "..." }}, "evidence": {{ "score": 1-4, "evidence": "..." }}, "writing": {{ "score": 1-4, "evidence": "..." }}, "feedback": "actionable feedback in 3 sentences" }}""" response = llm.generate(prompt) return json.loads(response) # 매 calibration # 매 N=3 judge → 매 average. 매 disagreement → 매 human review. ``` ### Inter-rater agreement (Cohen's kappa) ```python from sklearn.metrics import cohen_kappa_score def measure_reliability(rater1_scores, rater2_scores): kappa = cohen_kappa_score(rater1_scores, rater2_scores) if kappa < 0.4: return 'poor' if kappa < 0.6: return 'fair' if kappa < 0.8: return 'good' return 'excellent' ``` ### IRT (adaptive testing) ```python import numpy as np def irt_3pl(theta, a, b, c): """매 3-parameter logistic. theta: ability, a: discrimination, b: difficulty, c: guessing.""" return c + (1 - c) / (1 + np.exp(-a * (theta - b))) def adaptive_next_item(theta_estimate, item_pool, answered_ids): # 매 information 의 maximum 의 item. candidates = [item for item in item_pool if item.id not in answered_ids] info = lambda item: item.a**2 * irt_3pl(theta_estimate, item.a, item.b, item.c) * \ (1 - irt_3pl(theta_estimate, item.a, item.b, item.c)) return max(candidates, key=info) ``` ### Fairness check (group) ```python def fairness_check(scores, group_labels): by_group = collections.defaultdict(list) for score, group in zip(scores, group_labels): by_group[group].append(score) means = {g: np.mean(s) for g, s in by_group.items()} # 매 disparate impact max_mean = max(means.values()) min_mean = min(means.values()) if min_mean / max_mean < 0.8: return f'WARN: disparate impact: {min_mean/max_mean:.2f} < 0.8' return 'OK' ``` ### Portfolio assessment ```python class Portfolio: def __init__(self, student_id): self.student_id = student_id self.artifacts = [] def add(self, artifact): self.artifacts.append({ 'id': artifact.id, 'date': artifact.date, 'type': artifact.type, # essay, code, image 'reflection': artifact.reflection, }) def progression(self): # 매 시간 의 growth 의 visualize scores_over_time = [(a.date, a.score) for a in self.artifacts] return scores_over_time ``` ### ML evaluation suite (multi-dim) ```python def evaluate_model(model, eval_set): return { 'accuracy': accuracy(model, eval_set), 'fairness': fairness_check(model, eval_set, sensitive='gender'), 'safety': safety_score(model, harm_set), 'calibration': ece(model, eval_set), 'latency_p95': latency(model), 'cost_per_1k': cost(model), 'human_pref': pairwise_human(model, baseline, n=100), } ``` ## 🤔 결정 기준 | 상황 | Approach | |---|---| | Standardized test | Summative + IRT | | Personalized learning | Diagnostic + adaptive | | Skill development | Formative + portfolio | | LLM evaluation | Multi-metric + LLM-judge + human | | Hiring | Authentic + rubric + structured | | Performance review | 360° + portfolio | **기본값**: Multi-method + rubric + inter-rater check + fairness audit. ## 🔗 Graph - 부모: [[Evaluation]] - 응용: [[Rubric]] - ML parallel: [[ML-Evaluation]] · [[Benchmarks]] · [[LLM-as-Judge]] · [[Bias-Correction-Algorithm]] - Adjacent: [[Algorithmic-Fairness]] · [[Validity]] · [[Reliability]] ## 🤖 LLM 활용 **언제**: 매 educational system design. 매 ML evaluation suite. 매 performance review framework. 매 rubric 작성. **언제 X**: 매 single high-stakes metric (Goodhart). 매 fairness 의 ignore. ## ❌ 안티패턴 - **Single-metric**: 매 saturate / game. - **No rubric**: 매 inter-rater disagreement. - **Stale benchmark**: 매 contamination. - **No fairness check**: 매 disparate impact. - **Diagnostic 의 stigma**: 매 student labeling. - **LLM judge 의 single**: 매 self-bias. - **No validation 의 construct**: 매 wrong thing measured. ## 🧪 검증 / 중복 - Verified (educational psychology + ML evaluation literature). - 신뢰도 B. - Related: [[Benchmarks]] · [[Bias-Correction-Algorithm]] · [[Algorithmic-Fairness]] · [[LLM-as-Judge]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — type + criteria + ML parallel + rubric / IRT / fairness code |