Files
2nd/10_Wiki/Topics/DevOps_and_Security/Google Code Jam Dataset.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

167 lines
6.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-google-code-jam-dataset
title: Google Code Jam Dataset
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [GCJ Dataset, Code Jam Solutions Corpus, GCJ-297]
duplicate_of: none
source_trust_level: B
confidence_score: 0.85
verification_status: applied
tags: [dataset, code-llm, benchmark, programming-competition, deduplication]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: huggingface-datasets
---
# Google Code Jam Dataset
## 매 한 줄
> **"매 Google Code Jam 의 매 historical archive — 매 code clone detection / code LLM evaluation 의 standard corpus"**. Google 의 매 annual programming competition (2003-2023) 이 매 retire 되었지만 매 solution corpus 는 매 academic 으로 풍부 — 매 multiple solutions per problem, 매 다양한 언어 — 매 code clone, code translation, code-LM benchmark 의 raw material. 매 가장 많이 인용되는 매 GCJ-297 (Bui et al.) 로 매 297 problem × multiple langs.
## 매 핵심
### 매 dataset 의 특이성
- **Same-intent, varied implementations**: 매 단일 problem 에 매 thousands of correct solutions — 매 semantic equivalence 가 ground truth.
- **Multi-language**: C++, Java, Python, Go, Kotlin, …
- **Difficulty stratification**: Qualification → Round 1/2/3 → World Finals.
- **Test cases**: official input/output 이 partial 공개 (sample only) — full hidden.
### 매 main variants
1. **GCJ-297** (Bui et al. 2017): 297 problems, ~120k solutions, code clone benchmark.
2. **CodeNet** (IBM 2021): 매 GCJ + AIZU — 14M solutions, 4053 problems, 55 langs (superset).
3. **MBXP / HumanEval-X**: 매 not GCJ-derived 지만 매 같은 비교 대상 benchmark.
4. **APPS**: Codeforces + AtCoder + Code Jam mix — 매 LLM coding benchmark.
### 매 use cases
- **Code clone detection**: 매 Type-1/2/3/4 clone 의 ground truth.
- **Code LLM eval**: 매 contamination 위험 매 큼 — 매 Code Jam 매 GitHub 에 publicly indexed.
- **Translation**: 매 Java solution → 매 Python solution.
- **Style transfer**: 매 verbose vs 매 idiomatic.
## 💻 패턴
### Loading via Hugging Face
```python
from datasets import load_dataset
# CodeNet (largest superset including GCJ)
ds = load_dataset("Project-CodeNet/codenet", split="train", streaming=True)
for ex in ds.take(3):
print(ex["problem_id"], ex["language"], ex["status"], len(ex["code"]))
```
### Filter for GCJ subset only
```python
gcj = ds.filter(lambda x: x["dataset_origin"] == "google_code_jam")
print(gcj.info.splits)
```
### Group solutions by problem_id (clone-detection setup)
```python
from collections import defaultdict
buckets = defaultdict(list)
for ex in gcj:
if ex["status"] == "Accepted":
buckets[ex["problem_id"]].append(ex)
# Pair within bucket = positive (clone), across bucket = negative
positive_pairs = [(a, b) for sols in buckets.values()
for a, b in itertools.combinations(sols, 2)]
```
### Decontamination check (LLM training data)
```python
import hashlib
def near_dup_hash(code: str, k=5) -> set[int]:
tokens = code.split()
return {hash(' '.join(tokens[i:i+k])) for i in range(len(tokens) - k)}
train_hashes = set()
for ex in train_corpus:
train_hashes |= near_dup_hash(ex["code"])
contaminated = [
ex for ex in gcj_eval
if len(near_dup_hash(ex["code"]) & train_hashes) / max(1, len(near_dup_hash(ex["code"]))) > 0.5
]
print(f"contamination ratio: {len(contaminated) / len(gcj_eval):.2%}")
```
### Compile + run sandbox (judging on test cases)
```python
import subprocess, tempfile, pathlib
def judge(code: str, lang: str, stdin: str, expected: str, timeout=5):
with tempfile.TemporaryDirectory() as d:
p = pathlib.Path(d) / ("sol." + {"python": "py", "cpp": "cpp"}[lang])
p.write_text(code)
if lang == "cpp":
subprocess.run(["g++", "-O2", "-std=c++20", str(p), "-o", f"{d}/a"], check=True)
cmd = [f"{d}/a"]
else:
cmd = ["python3", str(p)]
try:
r = subprocess.run(cmd, input=stdin, capture_output=True, text=True, timeout=timeout)
return r.stdout.strip() == expected.strip()
except subprocess.TimeoutExpired:
return False
```
### Train/eval split for code translation
```python
import random
random.seed(0)
problems = list(buckets.keys())
random.shuffle(problems)
train_pids = set(problems[:int(0.9 * len(problems))])
train, eval = [], []
for pid, sols in buckets.items():
java = [s for s in sols if s["language"] == "java"]
py = [s for s in sols if s["language"] == "python"]
pairs = list(itertools.product(java, py))
(train if pid in train_pids else eval).extend(
{"src": j["code"], "tgt": p["code"]} for j, p in pairs
)
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Code clone benchmark | GCJ-297 (Bui et al.) |
| LLM coding eval | APPS or HumanEval (less contaminated) |
| Code translation | CodeNet pair-wise |
| Style benchmark | GCJ multi-solution per problem |
| Live evaluation | NEVER use GCJ alone (contamination) |
**기본값**: 매 LLM eval — APPS/HumanEval 매 main + GCJ 매 supplementary.
## 🔗 Graph
- 변형: [[HumanEval]]
## 🤖 LLM 활용
**언제**: 매 dataset filter pipeline 작성, contamination 검사 design, problem grouping logic.
**언제 X**: 매 LLM 자체 평가 — 매 GCJ 가 매 training data 에 포함되어 있을 확률 높음 (contamination).
## ❌ 안티패턴
- **GCJ for SOTA LLM eval without dedup**: 매 contamination 으로 매 score inflation.
- **Sample IO 만 사용**: 매 wrong-answer 가 매 test-case 통과 가능.
- **No timeout in judging**: 매 infinite loop 으로 OOM/hang.
- **Mixing accepted + WA**: 매 ground truth 의 정확성 저하.
- **Ignoring problem difficulty**: 매 stratified eval 필수.
## 🧪 검증 / 중복
- Verified (Bui et al. ICSE 2017, IBM Project CodeNet 2021, Hugging Face Hub).
- 신뢰도 B (semi-public, scraped).
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — GCJ corpus + CodeNet usage + decontamination |