"매 machine 이 human judge 와 30% 이상의 conversation 에서 human 으로 misclassified 되면 thinking 과 indistinguishable 하다고 판정". 매 1950 Alan Turing 의 "Computing Machinery and Intelligence" 의 imitation game. 매 2024-25 GPT-4 / Claude 의 controlled study에서 human-level pass 보고 (Jones & Bergen 2024 UCSD). 매 2026 현재 Turing Test 는 capability 측정 도구로서 obsolete, Chinese Room critique + behavioral benchmark + capability evaluation 으로 대체.
매 핵심
매 original imitation game (Turing 1950)
3 players: man (A), woman (B), interrogator (C).
C asks questions in writing, must determine which is which.
A 의 task: deceive C. B 의 task: help C.
Turing's substitution: replace A with machine. Does C error rate stay same?
매 misconception (common pop interpretation)
Pop version: "machine fools human into thinking it's human."
Original: comparison of machine deception rate vs man-deceiving-as-woman rate.
Turing's prediction: by 2000, machines will pass at ~30% rate after 5min.
매 critiques
Chinese Room (Searle 1980): passing test 은 understanding 의 evidence 아님. symbol manipulation ≠ semantics.
Imitation ≠ intelligence: human deception 은 narrow task. 매 mathematical reasoning, embodiment, learning 의 미측정.
Anthropocentric: intelligence 의 sole criterion 으로 human-likeness 가정.
Gameable: tricks (typos, refuse-to-answer, emotion mimicry) 으로 pass 가능.
Judge calibration: naive judge vs expert 의 결과 wildly 다름.
매 modern empirical results
2014 "Eugene Goostman": 33% pass at Royal Society. 매 13-yr-old Ukrainian persona 가 expectation lowering 으로 controversial pass.
2023 Jannai et al. (AI21): GPT-4 fooled humans at 60% rate in 2-min chat.
2024 Jones & Bergen (UCSD): GPT-4 passed at 54% (vs human 67%, ELIZA 22%). 매 first rigorously controlled pass.
2025 multiple replications: Claude / GPT-5 의 routine human-level performance.
Coffee test (Wozniak): make coffee in unfamiliar kitchen → embodiment.
Robot college student (Goertzel): take college courses, get degree.
Lovelace Test 2.0 (Riedl): create artifact human cannot, but expert can verify.
Winograd Schema (Levesque 2011): commonsense reasoning, originally Turing-resistant.
매 응용
AI history teaching.
Philosophy of mind discussion (consciousness, understanding).
Public communication of AI capability ("does AI think?").
Capability evaluation pre-2020 (now obsolete).
💻 패턴 (eval design lessons)
Pattern 1: Modern adversarial Turing protocol
1. Recruit N judges (calibrate by demographic, expertise).
2. Each judge: 5-min interrogation, 50% human / 50% AI random.
3. Force binary verdict (no "unsure").
4. Pass criterion: AI verdict = "human" at rate ≥ control human rate − ε.
5. Pre-register hypotheses, blind judges to study purpose.
1. Specify class C of artifacts (e.g., novel valid mathematical proof).
2. AI produces artifact a ∈ C.
3. Human expert verifies a is valid AND novel.
4. AI architect cannot explain how a was produced.
→ Tests creativity, not imitation.
매 결정 기준
목적
Eval
Historical / philosophical context
Turing Test
Capability measurement
MMLU, GPQA, HumanEval, ARC-AGI
Reasoning / novelty
Lovelace 2.0, ARC-AGI
Embodiment / general intelligence
Coffee test, robot college
Safety / alignment
RealToxicityPrompts, MLCommons AILuminate
기본값: capability + safety multi-benchmark. Turing Test 는 historical reference only.