2nd/10_Wiki/Topics/AI_and_ML/Turing Test.md

---
id: wiki-2026-0508-turing-test
title: Turing Test
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Imitation Game, Turing's Test]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [ai-history, philosophy, evaluation, agi]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: na
  framework: ai-philosophy
---

# Turing Test

## 매 한 줄
> **"매 machine 이 human judge 와 30% 이상의 conversation 에서 human 으로 misclassified 되면 thinking 과 indistinguishable 하다고 판정"**. 매 1950 Alan Turing 의 "Computing Machinery and Intelligence" 의 imitation game. 매 2024-25 GPT-4 / Claude 의 controlled study에서 human-level pass 보고 (Jones & Bergen 2024 UCSD). 매 2026 현재 Turing Test 는 capability 측정 도구로서 obsolete, Chinese Room critique + behavioral benchmark + capability evaluation 으로 대체.

## 매 핵심

### 매 original imitation game (Turing 1950)
- 3 players: man (A), woman (B), interrogator (C).
- C asks questions in writing, must determine which is which.
- A 의 task: deceive C. B 의 task: help C.
- Turing's substitution: replace A with machine. Does C error rate stay same?

### 매 misconception (common pop interpretation)
- Pop version: "machine fools human into thinking it's human."
- Original: comparison of machine deception rate vs man-deceiving-as-woman rate.
- Turing's prediction: by 2000, machines will pass at ~30% rate after 5min.

### 매 critiques
1. **Chinese Room (Searle 1980)**: passing test 은 understanding 의 evidence 아님. symbol manipulation ≠ semantics.
2. **Imitation ≠ intelligence**: human deception 은 narrow task. 매 mathematical reasoning, embodiment, learning 의 미측정.
3. **Anthropocentric**: intelligence 의 sole criterion 으로 human-likeness 가정.
4. **Gameable**: tricks (typos, refuse-to-answer, emotion mimicry) 으로 pass 가능.
5. **Judge calibration**: naive judge vs expert 의 결과 wildly 다름.

### 매 modern empirical results
- **2014 "Eugene Goostman"**: 33% pass at Royal Society. 매 13-yr-old Ukrainian persona 가 expectation lowering 으로 controversial pass.
- **2023 Jannai et al.** (AI21): GPT-4 fooled humans at 60% rate in 2-min chat.
- **2024 Jones & Bergen** (UCSD): GPT-4 passed at 54% (vs human 67%, ELIZA 22%). 매 first rigorously controlled pass.
- **2025 multiple replications**: Claude / GPT-5 의 routine human-level performance.

### 매 alternatives (post-Turing era)
1. **Capability benchmarks**: MMLU, HumanEval, GPQA, ARC-AGI, SWE-bench.
2. **Coffee test** (Wozniak): make coffee in unfamiliar kitchen → embodiment.
3. **Robot college student** (Goertzel): take college courses, get degree.
4. **Lovelace Test 2.0** (Riedl): create artifact human cannot, but expert can verify.
5. **Winograd Schema** (Levesque 2011): commonsense reasoning, originally Turing-resistant.

### 매 응용
1. AI history teaching.
2. Philosophy of mind discussion (consciousness, understanding).
3. Public communication of AI capability ("does AI think?").
4. Capability evaluation pre-2020 (now obsolete).

## 💻 패턴 (eval design lessons)

### Pattern 1: Modern adversarial Turing protocol
```text
1. Recruit N judges (calibrate by demographic, expertise).
2. Each judge: 5-min interrogation, 50% human / 50% AI random.
3. Force binary verdict (no "unsure").
4. Pass criterion: AI verdict = "human" at rate ≥ control human rate − ε.
5. Pre-register hypotheses, blind judges to study purpose.
```

### Pattern 2: Why public Turing demos mislead
```text
- Cherry-picked transcripts.
- Naive judges (not interrogating adversarially).
- Persona tricks (child, non-native speaker, tired, distracted).
- Self-selection bias (only impressive runs shown).
```

### Pattern 3: Capability-first eval (modern replacement)
```text
benchmarks = [
    "MMLU",        # broad knowledge
    "HumanEval",   # code generation
    "GPQA",        # graduate-level science
    "ARC-AGI",     # abstract reasoning
    "SWE-bench",   # real software engineering
    "HLE",         # Humanity's Last Exam (2025)
]
# Pass = top-percentile human expert performance per task.
```

### Pattern 4: Behavioral safety eval (orthogonal to Turing)
```text
- Refusal rate on harmful prompts.
- Calibration (uncertainty matches accuracy).
- Sycophancy (agree-with-user metric).
- Honesty (TruthfulQA, FactScore).
```

### Pattern 5: Lovelace 2.0 framework
```text
1. Specify class C of artifacts (e.g., novel valid mathematical proof).
2. AI produces artifact a ∈ C.
3. Human expert verifies a is valid AND novel.
4. AI architect cannot explain how a was produced.
→ Tests creativity, not imitation.
```

## 매 결정 기준
| 목적 | Eval |
|---|---|
| Historical / philosophical context | Turing Test |
| Capability measurement | MMLU, GPQA, HumanEval, ARC-AGI |
| Reasoning / novelty | Lovelace 2.0, ARC-AGI |
| Embodiment / general intelligence | Coffee test, robot college |
| Safety / alignment | RealToxicityPrompts, MLCommons AILuminate |

**기본값**: capability + safety multi-benchmark. Turing Test 는 historical reference only.

## 🔗 Graph
- 부모: [[Philosophy of AI]]
- 변형: [[Imitation Game]]

## 🤖 LLM 활용
**언제**: AI history, philosophy of mind 토론, public communication.
**언제 X**: actual capability measurement (use modern benchmarks).

## ❌ 안티패턴
- **"GPT passed Turing → AGI"**: imitation ≠ general intelligence. capability gaps remain.
- **Naive judge eval**: untrained user 의 verdict 는 systematic bias.
- **Single-conversation pass**: 5-min snapshot. long-horizon coherence 미측정.
- **Persona escape hatch**: "I'm a tired teenager" 으로 weakness 정당화.
- **Conflating with consciousness**: Turing Test 는 behavior. consciousness 의 evidence 아님.

## 🧪 검증 / 중복
- Verified (Turing 1950 "Computing Machinery and Intelligence" Mind 59; Searle 1980 "Minds, Brains, and Programs"; Jones & Bergen 2024 arxiv 2405.08007; Riedl 2014 Lovelace 2.0).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Turing Test history + 2024 Jones-Bergen pass + modern alternatives |