--- id: wiki-2026-0508-google-code-jam-dataset title: Google Code Jam Dataset category: 10_Wiki/Topics status: verified canonical_id: self aliases: [GCJ Dataset, Code Jam Solutions Corpus, GCJ-297] duplicate_of: none source_trust_level: B confidence_score: 0.85 verification_status: applied tags: [dataset, code-llm, benchmark, programming-competition, deduplication] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: huggingface-datasets --- # Google Code Jam Dataset ## 매 한 줄 > **"매 Google Code Jam 의 매 historical archive — 매 code clone detection / code LLM evaluation 의 standard corpus"**. Google 의 매 annual programming competition (2003-2023) 이 매 retire 되었지만 매 solution corpus 는 매 academic 으로 풍부 — 매 multiple solutions per problem, 매 다양한 언어 — 매 code clone, code translation, code-LM benchmark 의 raw material. 매 가장 많이 인용되는 매 GCJ-297 (Bui et al.) 로 매 297 problem × multiple langs. ## 매 핵심 ### 매 dataset 의 특이성 - **Same-intent, varied implementations**: 매 단일 problem 에 매 thousands of correct solutions — 매 semantic equivalence 가 ground truth. - **Multi-language**: C++, Java, Python, Go, Kotlin, … - **Difficulty stratification**: Qualification → Round 1/2/3 → World Finals. - **Test cases**: official input/output 이 partial 공개 (sample only) — full hidden. ### 매 main variants 1. **GCJ-297** (Bui et al. 2017): 297 problems, ~120k solutions, code clone benchmark. 2. **CodeNet** (IBM 2021): 매 GCJ + AIZU — 14M solutions, 4053 problems, 55 langs (superset). 3. **MBXP / HumanEval-X**: 매 not GCJ-derived 지만 매 같은 비교 대상 benchmark. 4. **APPS**: Codeforces + AtCoder + Code Jam mix — 매 LLM coding benchmark. ### 매 use cases - **Code clone detection**: 매 Type-1/2/3/4 clone 의 ground truth. - **Code LLM eval**: 매 contamination 위험 매 큼 — 매 Code Jam 매 GitHub 에 publicly indexed. - **Translation**: 매 Java solution → 매 Python solution. - **Style transfer**: 매 verbose vs 매 idiomatic. ## 💻 패턴 ### Loading via Hugging Face ```python from datasets import load_dataset # CodeNet (largest superset including GCJ) ds = load_dataset("Project-CodeNet/codenet", split="train", streaming=True) for ex in ds.take(3): print(ex["problem_id"], ex["language"], ex["status"], len(ex["code"])) ``` ### Filter for GCJ subset only ```python gcj = ds.filter(lambda x: x["dataset_origin"] == "google_code_jam") print(gcj.info.splits) ``` ### Group solutions by problem_id (clone-detection setup) ```python from collections import defaultdict buckets = defaultdict(list) for ex in gcj: if ex["status"] == "Accepted": buckets[ex["problem_id"]].append(ex) # Pair within bucket = positive (clone), across bucket = negative positive_pairs = [(a, b) for sols in buckets.values() for a, b in itertools.combinations(sols, 2)] ``` ### Decontamination check (LLM training data) ```python import hashlib def near_dup_hash(code: str, k=5) -> set[int]: tokens = code.split() return {hash(' '.join(tokens[i:i+k])) for i in range(len(tokens) - k)} train_hashes = set() for ex in train_corpus: train_hashes |= near_dup_hash(ex["code"]) contaminated = [ ex for ex in gcj_eval if len(near_dup_hash(ex["code"]) & train_hashes) / max(1, len(near_dup_hash(ex["code"]))) > 0.5 ] print(f"contamination ratio: {len(contaminated) / len(gcj_eval):.2%}") ``` ### Compile + run sandbox (judging on test cases) ```python import subprocess, tempfile, pathlib def judge(code: str, lang: str, stdin: str, expected: str, timeout=5): with tempfile.TemporaryDirectory() as d: p = pathlib.Path(d) / ("sol." + {"python": "py", "cpp": "cpp"}[lang]) p.write_text(code) if lang == "cpp": subprocess.run(["g++", "-O2", "-std=c++20", str(p), "-o", f"{d}/a"], check=True) cmd = [f"{d}/a"] else: cmd = ["python3", str(p)] try: r = subprocess.run(cmd, input=stdin, capture_output=True, text=True, timeout=timeout) return r.stdout.strip() == expected.strip() except subprocess.TimeoutExpired: return False ``` ### Train/eval split for code translation ```python import random random.seed(0) problems = list(buckets.keys()) random.shuffle(problems) train_pids = set(problems[:int(0.9 * len(problems))]) train, eval = [], [] for pid, sols in buckets.items(): java = [s for s in sols if s["language"] == "java"] py = [s for s in sols if s["language"] == "python"] pairs = list(itertools.product(java, py)) (train if pid in train_pids else eval).extend( {"src": j["code"], "tgt": p["code"]} for j, p in pairs ) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Code clone benchmark | GCJ-297 (Bui et al.) | | LLM coding eval | APPS or HumanEval (less contaminated) | | Code translation | CodeNet pair-wise | | Style benchmark | GCJ multi-solution per problem | | Live evaluation | NEVER use GCJ alone (contamination) | **기본값**: 매 LLM eval — APPS/HumanEval 매 main + GCJ 매 supplementary. ## 🔗 Graph - 변형: [[HumanEval]] ## 🤖 LLM 활용 **언제**: 매 dataset filter pipeline 작성, contamination 검사 design, problem grouping logic. **언제 X**: 매 LLM 자체 평가 — 매 GCJ 가 매 training data 에 포함되어 있을 확률 높음 (contamination). ## ❌ 안티패턴 - **GCJ for SOTA LLM eval without dedup**: 매 contamination 으로 매 score inflation. - **Sample IO 만 사용**: 매 wrong-answer 가 매 test-case 통과 가능. - **No timeout in judging**: 매 infinite loop 으로 OOM/hang. - **Mixing accepted + WA**: 매 ground truth 의 정확성 저하. - **Ignoring problem difficulty**: 매 stratified eval 필수. ## 🧪 검증 / 중복 - Verified (Bui et al. ICSE 2017, IBM Project CodeNet 2021, Hugging Face Hub). - 신뢰도 B (semi-public, scraped). ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — GCJ corpus + CodeNet usage + decontamination |