f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
167 lines
6.0 KiB
Markdown
167 lines
6.0 KiB
Markdown
---
|
||
id: wiki-2026-0508-google-code-jam-dataset
|
||
title: Google Code Jam Dataset
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [GCJ Dataset, Code Jam Solutions Corpus, GCJ-297]
|
||
duplicate_of: none
|
||
source_trust_level: B
|
||
confidence_score: 0.85
|
||
verification_status: applied
|
||
tags: [dataset, code-llm, benchmark, programming-competition, deduplication]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: python
|
||
framework: huggingface-datasets
|
||
---
|
||
|
||
# Google Code Jam Dataset
|
||
|
||
## 매 한 줄
|
||
> **"매 Google Code Jam 의 매 historical archive — 매 code clone detection / code LLM evaluation 의 standard corpus"**. Google 의 매 annual programming competition (2003-2023) 이 매 retire 되었지만 매 solution corpus 는 매 academic 으로 풍부 — 매 multiple solutions per problem, 매 다양한 언어 — 매 code clone, code translation, code-LM benchmark 의 raw material. 매 가장 많이 인용되는 매 GCJ-297 (Bui et al.) 로 매 297 problem × multiple langs.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 dataset 의 특이성
|
||
- **Same-intent, varied implementations**: 매 단일 problem 에 매 thousands of correct solutions — 매 semantic equivalence 가 ground truth.
|
||
- **Multi-language**: C++, Java, Python, Go, Kotlin, …
|
||
- **Difficulty stratification**: Qualification → Round 1/2/3 → World Finals.
|
||
- **Test cases**: official input/output 이 partial 공개 (sample only) — full hidden.
|
||
|
||
### 매 main variants
|
||
1. **GCJ-297** (Bui et al. 2017): 297 problems, ~120k solutions, code clone benchmark.
|
||
2. **CodeNet** (IBM 2021): 매 GCJ + AIZU — 14M solutions, 4053 problems, 55 langs (superset).
|
||
3. **MBXP / HumanEval-X**: 매 not GCJ-derived 지만 매 같은 비교 대상 benchmark.
|
||
4. **APPS**: Codeforces + AtCoder + Code Jam mix — 매 LLM coding benchmark.
|
||
|
||
### 매 use cases
|
||
- **Code clone detection**: 매 Type-1/2/3/4 clone 의 ground truth.
|
||
- **Code LLM eval**: 매 contamination 위험 매 큼 — 매 Code Jam 매 GitHub 에 publicly indexed.
|
||
- **Translation**: 매 Java solution → 매 Python solution.
|
||
- **Style transfer**: 매 verbose vs 매 idiomatic.
|
||
|
||
## 💻 패턴
|
||
|
||
### Loading via Hugging Face
|
||
```python
|
||
from datasets import load_dataset
|
||
|
||
# CodeNet (largest superset including GCJ)
|
||
ds = load_dataset("Project-CodeNet/codenet", split="train", streaming=True)
|
||
for ex in ds.take(3):
|
||
print(ex["problem_id"], ex["language"], ex["status"], len(ex["code"]))
|
||
```
|
||
|
||
### Filter for GCJ subset only
|
||
```python
|
||
gcj = ds.filter(lambda x: x["dataset_origin"] == "google_code_jam")
|
||
print(gcj.info.splits)
|
||
```
|
||
|
||
### Group solutions by problem_id (clone-detection setup)
|
||
```python
|
||
from collections import defaultdict
|
||
buckets = defaultdict(list)
|
||
for ex in gcj:
|
||
if ex["status"] == "Accepted":
|
||
buckets[ex["problem_id"]].append(ex)
|
||
|
||
# Pair within bucket = positive (clone), across bucket = negative
|
||
positive_pairs = [(a, b) for sols in buckets.values()
|
||
for a, b in itertools.combinations(sols, 2)]
|
||
```
|
||
|
||
### Decontamination check (LLM training data)
|
||
```python
|
||
import hashlib
|
||
def near_dup_hash(code: str, k=5) -> set[int]:
|
||
tokens = code.split()
|
||
return {hash(' '.join(tokens[i:i+k])) for i in range(len(tokens) - k)}
|
||
|
||
train_hashes = set()
|
||
for ex in train_corpus:
|
||
train_hashes |= near_dup_hash(ex["code"])
|
||
|
||
contaminated = [
|
||
ex for ex in gcj_eval
|
||
if len(near_dup_hash(ex["code"]) & train_hashes) / max(1, len(near_dup_hash(ex["code"]))) > 0.5
|
||
]
|
||
print(f"contamination ratio: {len(contaminated) / len(gcj_eval):.2%}")
|
||
```
|
||
|
||
### Compile + run sandbox (judging on test cases)
|
||
```python
|
||
import subprocess, tempfile, pathlib
|
||
|
||
def judge(code: str, lang: str, stdin: str, expected: str, timeout=5):
|
||
with tempfile.TemporaryDirectory() as d:
|
||
p = pathlib.Path(d) / ("sol." + {"python": "py", "cpp": "cpp"}[lang])
|
||
p.write_text(code)
|
||
if lang == "cpp":
|
||
subprocess.run(["g++", "-O2", "-std=c++20", str(p), "-o", f"{d}/a"], check=True)
|
||
cmd = [f"{d}/a"]
|
||
else:
|
||
cmd = ["python3", str(p)]
|
||
try:
|
||
r = subprocess.run(cmd, input=stdin, capture_output=True, text=True, timeout=timeout)
|
||
return r.stdout.strip() == expected.strip()
|
||
except subprocess.TimeoutExpired:
|
||
return False
|
||
```
|
||
|
||
### Train/eval split for code translation
|
||
```python
|
||
import random
|
||
random.seed(0)
|
||
problems = list(buckets.keys())
|
||
random.shuffle(problems)
|
||
train_pids = set(problems[:int(0.9 * len(problems))])
|
||
|
||
train, eval = [], []
|
||
for pid, sols in buckets.items():
|
||
java = [s for s in sols if s["language"] == "java"]
|
||
py = [s for s in sols if s["language"] == "python"]
|
||
pairs = list(itertools.product(java, py))
|
||
(train if pid in train_pids else eval).extend(
|
||
{"src": j["code"], "tgt": p["code"]} for j, p in pairs
|
||
)
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 상황 | Approach |
|
||
|---|---|
|
||
| Code clone benchmark | GCJ-297 (Bui et al.) |
|
||
| LLM coding eval | APPS or HumanEval (less contaminated) |
|
||
| Code translation | CodeNet pair-wise |
|
||
| Style benchmark | GCJ multi-solution per problem |
|
||
| Live evaluation | NEVER use GCJ alone (contamination) |
|
||
|
||
**기본값**: 매 LLM eval — APPS/HumanEval 매 main + GCJ 매 supplementary.
|
||
|
||
## 🔗 Graph
|
||
- 변형: [[HumanEval]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: 매 dataset filter pipeline 작성, contamination 검사 design, problem grouping logic.
|
||
**언제 X**: 매 LLM 자체 평가 — 매 GCJ 가 매 training data 에 포함되어 있을 확률 높음 (contamination).
|
||
|
||
## ❌ 안티패턴
|
||
- **GCJ for SOTA LLM eval without dedup**: 매 contamination 으로 매 score inflation.
|
||
- **Sample IO 만 사용**: 매 wrong-answer 가 매 test-case 통과 가능.
|
||
- **No timeout in judging**: 매 infinite loop 으로 OOM/hang.
|
||
- **Mixing accepted + WA**: 매 ground truth 의 정확성 저하.
|
||
- **Ignoring problem difficulty**: 매 stratified eval 필수.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (Bui et al. ICSE 2017, IBM Project CodeNet 2021, Hugging Face Hub).
|
||
- 신뢰도 B (semi-public, scraped).
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — GCJ corpus + CodeNet usage + decontamination |
|