2nd/10_Wiki/Topics/DevOps_and_Security/Google Code Jam Dataset.md

---
id: wiki-2026-0508-google-code-jam-dataset
title: Google Code Jam Dataset
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [GCJ Dataset, Code Jam Solutions Corpus, GCJ-297]
duplicate_of: none
source_trust_level: B
confidence_score: 0.85
verification_status: applied
tags: [dataset, code-llm, benchmark, programming-competition, deduplication]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: python
  framework: huggingface-datasets
---

# Google Code Jam Dataset

## 매 한 줄
> **"매 Google Code Jam 의 매 historical archive — 매 code clone detection / code LLM evaluation 의 standard corpus"**. Google 의 매 annual programming competition (2003-2023) 이 매 retire 되었지만 매 solution corpus 는 매 academic 으로 풍부 — 매 multiple solutions per problem, 매 다양한 언어 — 매 code clone, code translation, code-LM benchmark 의 raw material. 매 가장 많이 인용되는 매 GCJ-297 (Bui et al.) 로 매 297 problem × multiple langs.

## 매 핵심

### 매 dataset 의 특이성
- **Same-intent, varied implementations**: 매 단일 problem 에 매 thousands of correct solutions — 매 semantic equivalence 가 ground truth.
- **Multi-language**: C++, Java, Python, Go, Kotlin, …
- **Difficulty stratification**: Qualification → Round 1/2/3 → World Finals.
- **Test cases**: official input/output 이 partial 공개 (sample only) — full hidden.

### 매 main variants
1. **GCJ-297** (Bui et al. 2017): 297 problems, ~120k solutions, code clone benchmark.
2. **CodeNet** (IBM 2021): 매 GCJ + AIZU — 14M solutions, 4053 problems, 55 langs (superset).
3. **MBXP / HumanEval-X**: 매 not GCJ-derived 지만 매 같은 비교 대상 benchmark.
4. **APPS**: Codeforces + AtCoder + Code Jam mix — 매 LLM coding benchmark.

### 매 use cases
- **Code clone detection**: 매 Type-1/2/3/4 clone 의 ground truth.
- **Code LLM eval**: 매 contamination 위험 매 큼 — 매 Code Jam 매 GitHub 에 publicly indexed.
- **Translation**: 매 Java solution → 매 Python solution.
- **Style transfer**: 매 verbose vs 매 idiomatic.

## 💻 패턴

### Loading via Hugging Face
```python
from datasets import load_dataset

# CodeNet (largest superset including GCJ)
ds = load_dataset("Project-CodeNet/codenet", split="train", streaming=True)
for ex in ds.take(3):
    print(ex["problem_id"], ex["language"], ex["status"], len(ex["code"]))
```

### Filter for GCJ subset only
```python
gcj = ds.filter(lambda x: x["dataset_origin"] == "google_code_jam")
print(gcj.info.splits)
```

### Group solutions by problem_id (clone-detection setup)
```python
from collections import defaultdict
buckets = defaultdict(list)
for ex in gcj:
    if ex["status"] == "Accepted":
        buckets[ex["problem_id"]].append(ex)

# Pair within bucket = positive (clone), across bucket = negative
positive_pairs = [(a, b) for sols in buckets.values()
                  for a, b in itertools.combinations(sols, 2)]
```

### Decontamination check (LLM training data)
```python
import hashlib
def near_dup_hash(code: str, k=5) -> set[int]:
    tokens = code.split()
    return {hash(' '.join(tokens[i:i+k])) for i in range(len(tokens) - k)}

train_hashes = set()
for ex in train_corpus:
    train_hashes |= near_dup_hash(ex["code"])

contaminated = [
    ex for ex in gcj_eval
    if len(near_dup_hash(ex["code"]) & train_hashes) / max(1, len(near_dup_hash(ex["code"]))) > 0.5
]
print(f"contamination ratio: {len(contaminated) / len(gcj_eval):.2%}")
```

### Compile + run sandbox (judging on test cases)
```python
import subprocess, tempfile, pathlib

def judge(code: str, lang: str, stdin: str, expected: str, timeout=5):
    with tempfile.TemporaryDirectory() as d:
        p = pathlib.Path(d) / ("sol." + {"python": "py", "cpp": "cpp"}[lang])
        p.write_text(code)
        if lang == "cpp":
            subprocess.run(["g++", "-O2", "-std=c++20", str(p), "-o", f"{d}/a"], check=True)
            cmd = [f"{d}/a"]
        else:
            cmd = ["python3", str(p)]
        try:
            r = subprocess.run(cmd, input=stdin, capture_output=True, text=True, timeout=timeout)
            return r.stdout.strip() == expected.strip()
        except subprocess.TimeoutExpired:
            return False
```

### Train/eval split for code translation
```python
import random
random.seed(0)
problems = list(buckets.keys())
random.shuffle(problems)
train_pids = set(problems[:int(0.9 * len(problems))])

train, eval = [], []
for pid, sols in buckets.items():
    java = [s for s in sols if s["language"] == "java"]
    py   = [s for s in sols if s["language"] == "python"]
    pairs = list(itertools.product(java, py))
    (train if pid in train_pids else eval).extend(
        {"src": j["code"], "tgt": p["code"]} for j, p in pairs
    )
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Code clone benchmark | GCJ-297 (Bui et al.) |
| LLM coding eval | APPS or HumanEval (less contaminated) |
| Code translation | CodeNet pair-wise |
| Style benchmark | GCJ multi-solution per problem |
| Live evaluation | NEVER use GCJ alone (contamination) |

**기본값**: 매 LLM eval — APPS/HumanEval 매 main + GCJ 매 supplementary.

## 🔗 Graph
- 변형: [[HumanEval]]

## 🤖 LLM 활용
**언제**: 매 dataset filter pipeline 작성, contamination 검사 design, problem grouping logic.
**언제 X**: 매 LLM 자체 평가 — 매 GCJ 가 매 training data 에 포함되어 있을 확률 높음 (contamination).

## ❌ 안티패턴
- **GCJ for SOTA LLM eval without dedup**: 매 contamination 으로 매 score inflation.
- **Sample IO 만 사용**: 매 wrong-answer 가 매 test-case 통과 가능.
- **No timeout in judging**: 매 infinite loop 으로 OOM/hang.
- **Mixing accepted + WA**: 매 ground truth 의 정확성 저하.
- **Ignoring problem difficulty**: 매 stratified eval 필수.

## 🧪 검증 / 중복
- Verified (Bui et al. ICSE 2017, IBM Project CodeNet 2021, Hugging Face Hub).
- 신뢰도 B (semi-public, scraped).

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — GCJ corpus + CodeNet usage + decontamination |