--- id: wiki-2026-0508-turing-test title: Turing Test category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Imitation Game, Turing's Test] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [ai-history, philosophy, evaluation, agi] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: na framework: ai-philosophy --- # Turing Test ## 매 한 줄 > **"매 machine 이 human judge 와 30% 이상의 conversation 에서 human 으로 misclassified 되면 thinking 과 indistinguishable 하다고 판정"**. 매 1950 Alan Turing 의 "Computing Machinery and Intelligence" 의 imitation game. 매 2024-25 GPT-4 / Claude 의 controlled study에서 human-level pass 보고 (Jones & Bergen 2024 UCSD). 매 2026 현재 Turing Test 는 capability 측정 도구로서 obsolete, Chinese Room critique + behavioral benchmark + capability evaluation 으로 대체. ## 매 핵심 ### 매 original imitation game (Turing 1950) - 3 players: man (A), woman (B), interrogator (C). - C asks questions in writing, must determine which is which. - A 의 task: deceive C. B 의 task: help C. - Turing's substitution: replace A with machine. Does C error rate stay same? ### 매 misconception (common pop interpretation) - Pop version: "machine fools human into thinking it's human." - Original: comparison of machine deception rate vs man-deceiving-as-woman rate. - Turing's prediction: by 2000, machines will pass at ~30% rate after 5min. ### 매 critiques 1. **Chinese Room (Searle 1980)**: passing test 은 understanding 의 evidence 아님. symbol manipulation ≠ semantics. 2. **Imitation ≠ intelligence**: human deception 은 narrow task. 매 mathematical reasoning, embodiment, learning 의 미측정. 3. **Anthropocentric**: intelligence 의 sole criterion 으로 human-likeness 가정. 4. **Gameable**: tricks (typos, refuse-to-answer, emotion mimicry) 으로 pass 가능. 5. **Judge calibration**: naive judge vs expert 의 결과 wildly 다름. ### 매 modern empirical results - **2014 "Eugene Goostman"**: 33% pass at Royal Society. 매 13-yr-old Ukrainian persona 가 expectation lowering 으로 controversial pass. - **2023 Jannai et al.** (AI21): GPT-4 fooled humans at 60% rate in 2-min chat. - **2024 Jones & Bergen** (UCSD): GPT-4 passed at 54% (vs human 67%, ELIZA 22%). 매 first rigorously controlled pass. - **2025 multiple replications**: Claude / GPT-5 의 routine human-level performance. ### 매 alternatives (post-Turing era) 1. **Capability benchmarks**: MMLU, HumanEval, GPQA, ARC-AGI, SWE-bench. 2. **Coffee test** (Wozniak): make coffee in unfamiliar kitchen → embodiment. 3. **Robot college student** (Goertzel): take college courses, get degree. 4. **Lovelace Test 2.0** (Riedl): create artifact human cannot, but expert can verify. 5. **Winograd Schema** (Levesque 2011): commonsense reasoning, originally Turing-resistant. ### 매 응용 1. AI history teaching. 2. Philosophy of mind discussion (consciousness, understanding). 3. Public communication of AI capability ("does AI think?"). 4. Capability evaluation pre-2020 (now obsolete). ## 💻 패턴 (eval design lessons) ### Pattern 1: Modern adversarial Turing protocol ```text 1. Recruit N judges (calibrate by demographic, expertise). 2. Each judge: 5-min interrogation, 50% human / 50% AI random. 3. Force binary verdict (no "unsure"). 4. Pass criterion: AI verdict = "human" at rate ≥ control human rate − ε. 5. Pre-register hypotheses, blind judges to study purpose. ``` ### Pattern 2: Why public Turing demos mislead ```text - Cherry-picked transcripts. - Naive judges (not interrogating adversarially). - Persona tricks (child, non-native speaker, tired, distracted). - Self-selection bias (only impressive runs shown). ``` ### Pattern 3: Capability-first eval (modern replacement) ```text benchmarks = [ "MMLU", # broad knowledge "HumanEval", # code generation "GPQA", # graduate-level science "ARC-AGI", # abstract reasoning "SWE-bench", # real software engineering "HLE", # Humanity's Last Exam (2025) ] # Pass = top-percentile human expert performance per task. ``` ### Pattern 4: Behavioral safety eval (orthogonal to Turing) ```text - Refusal rate on harmful prompts. - Calibration (uncertainty matches accuracy). - Sycophancy (agree-with-user metric). - Honesty (TruthfulQA, FactScore). ``` ### Pattern 5: Lovelace 2.0 framework ```text 1. Specify class C of artifacts (e.g., novel valid mathematical proof). 2. AI produces artifact a ∈ C. 3. Human expert verifies a is valid AND novel. 4. AI architect cannot explain how a was produced. → Tests creativity, not imitation. ``` ## 매 결정 기준 | 목적 | Eval | |---|---| | Historical / philosophical context | Turing Test | | Capability measurement | MMLU, GPQA, HumanEval, ARC-AGI | | Reasoning / novelty | Lovelace 2.0, ARC-AGI | | Embodiment / general intelligence | Coffee test, robot college | | Safety / alignment | RealToxicityPrompts, MLCommons AILuminate | **기본값**: capability + safety multi-benchmark. Turing Test 는 historical reference only. ## 🔗 Graph - 부모: [[Philosophy of AI]] - 변형: [[Imitation Game]] ## 🤖 LLM 활용 **언제**: AI history, philosophy of mind 토론, public communication. **언제 X**: actual capability measurement (use modern benchmarks). ## ❌ 안티패턴 - **"GPT passed Turing → AGI"**: imitation ≠ general intelligence. capability gaps remain. - **Naive judge eval**: untrained user 의 verdict 는 systematic bias. - **Single-conversation pass**: 5-min snapshot. long-horizon coherence 미측정. - **Persona escape hatch**: "I'm a tired teenager" 으로 weakness 정당화. - **Conflating with consciousness**: Turing Test 는 behavior. consciousness 의 evidence 아님. ## 🧪 검증 / 중복 - Verified (Turing 1950 "Computing Machinery and Intelligence" Mind 59; Searle 1980 "Minds, Brains, and Programs"; Jones & Bergen 2024 arxiv 2405.08007; Riedl 2014 Lovelace 2.0). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Turing Test history + 2024 Jones-Bergen pass + modern alternatives |