Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

6.0 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Google Code Jam Dataset

매 한 줄

"매 Google Code Jam 의 매 historical archive — 매 code clone detection / code LLM evaluation 의 standard corpus". Google 의 매 annual programming competition (2003-2023) 이 매 retire 되었지만 매 solution corpus 는 매 academic 으로 풍부 — 매 multiple solutions per problem, 매 다양한 언어 — 매 code clone, code translation, code-LM benchmark 의 raw material. 매 가장 많이 인용되는 매 GCJ-297 (Bui et al.) 로 매 297 problem × multiple langs.

매 핵심

매 dataset 의 특이성

Same-intent, varied implementations: 매 단일 problem 에 매 thousands of correct solutions — 매 semantic equivalence 가 ground truth.
Multi-language: C++, Java, Python, Go, Kotlin, …
Difficulty stratification: Qualification → Round 1/2/3 → World Finals.
Test cases: official input/output 이 partial 공개 (sample only) — full hidden.

매 main variants

GCJ-297 (Bui et al. 2017): 297 problems, ~120k solutions, code clone benchmark.
CodeNet (IBM 2021): 매 GCJ + AIZU — 14M solutions, 4053 problems, 55 langs (superset).
MBXP / HumanEval-X: 매 not GCJ-derived 지만 매 같은 비교 대상 benchmark.
APPS: Codeforces + AtCoder + Code Jam mix — 매 LLM coding benchmark.

매 use cases

Code clone detection: 매 Type-1/2/3/4 clone 의 ground truth.
Code LLM eval: 매 contamination 위험 매 큼 — 매 Code Jam 매 GitHub 에 publicly indexed.
Translation: 매 Java solution → 매 Python solution.
Style transfer: 매 verbose vs 매 idiomatic.

💻 패턴

Loading via Hugging Face

from datasets import load_dataset

# CodeNet (largest superset including GCJ)
ds = load_dataset("Project-CodeNet/codenet", split="train", streaming=True)
for ex in ds.take(3):
    print(ex["problem_id"], ex["language"], ex["status"], len(ex["code"]))

Filter for GCJ subset only

gcj = ds.filter(lambda x: x["dataset_origin"] == "google_code_jam")
print(gcj.info.splits)

Group solutions by problem_id (clone-detection setup)

from collections import defaultdict
buckets = defaultdict(list)
for ex in gcj:
    if ex["status"] == "Accepted":
        buckets[ex["problem_id"]].append(ex)

# Pair within bucket = positive (clone), across bucket = negative
positive_pairs = [(a, b) for sols in buckets.values()
                  for a, b in itertools.combinations(sols, 2)]

Decontamination check (LLM training data)

import hashlib
def near_dup_hash(code: str, k=5) -> set[int]:
    tokens = code.split()
    return {hash(' '.join(tokens[i:i+k])) for i in range(len(tokens) - k)}

train_hashes = set()
for ex in train_corpus:
    train_hashes |= near_dup_hash(ex["code"])

contaminated = [
    ex for ex in gcj_eval
    if len(near_dup_hash(ex["code"]) & train_hashes) / max(1, len(near_dup_hash(ex["code"]))) > 0.5
]
print(f"contamination ratio: {len(contaminated) / len(gcj_eval):.2%}")

Compile + run sandbox (judging on test cases)

import subprocess, tempfile, pathlib

def judge(code: str, lang: str, stdin: str, expected: str, timeout=5):
    with tempfile.TemporaryDirectory() as d:
        p = pathlib.Path(d) / ("sol." + {"python": "py", "cpp": "cpp"}[lang])
        p.write_text(code)
        if lang == "cpp":
            subprocess.run(["g++", "-O2", "-std=c++20", str(p), "-o", f"{d}/a"], check=True)
            cmd = [f"{d}/a"]
        else:
            cmd = ["python3", str(p)]
        try:
            r = subprocess.run(cmd, input=stdin, capture_output=True, text=True, timeout=timeout)
            return r.stdout.strip() == expected.strip()
        except subprocess.TimeoutExpired:
            return False

Train/eval split for code translation

import random
random.seed(0)
problems = list(buckets.keys())
random.shuffle(problems)
train_pids = set(problems[:int(0.9 * len(problems))])

train, eval = [], []
for pid, sols in buckets.items():
    java = [s for s in sols if s["language"] == "java"]
    py   = [s for s in sols if s["language"] == "python"]
    pairs = list(itertools.product(java, py))
    (train if pid in train_pids else eval).extend(
        {"src": j["code"], "tgt": p["code"]} for j, p in pairs
    )

매 결정 기준

상황	Approach
Code clone benchmark	GCJ-297 (Bui et al.)
LLM coding eval	APPS or HumanEval (less contaminated)
Code translation	CodeNet pair-wise
Style benchmark	GCJ multi-solution per problem
Live evaluation	NEVER use GCJ alone (contamination)

기본값: 매 LLM eval — APPS/HumanEval 매 main + GCJ 매 supplementary.

🔗 Graph

변형: HumanEval

🤖 LLM 활용

언제: 매 dataset filter pipeline 작성, contamination 검사 design, problem grouping logic. 언제 X: 매 LLM 자체 평가 — 매 GCJ 가 매 training data 에 포함되어 있을 확률 높음 (contamination).

❌ 안티패턴

GCJ for SOTA LLM eval without dedup: 매 contamination 으로 매 score inflation.
Sample IO 만 사용: 매 wrong-answer 가 매 test-case 통과 가능.
No timeout in judging: 매 infinite loop 으로 OOM/hang.
Mixing accepted + WA: 매 ground truth 의 정확성 저하.
Ignoring problem difficulty: 매 stratified eval 필수.

🧪 검증 / 중복

Verified (Bui et al. ICSE 2017, IBM Project CodeNet 2021, Hugging Face Hub).
신뢰도 B (semi-public, scraped).

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — GCJ corpus + CodeNet usage + decontamination

6.0 KiB Raw Blame History Unescape Escape