--- id: wiki-2026-0508-code-stylometry-코드-문체론 title: Code Stylometry (코드 문체론) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Authorship Attribution, Code Fingerprinting, Programmer Identification] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [security, ml, forensics, privacy] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: scikit-learn/transformers --- # Code Stylometry (코드 문체론) ## 매 한 줄 > **"매 코드 작성자를 매 stylistic feature 로 식별하는 ML 기법"**. Caliskan et al. 2015 (USENIX) 가 random forest 로 250 명 중 94% 식별. 매 modern era — CodeBERT/StarCoder embedding 기반 분류기로 매 더 강력해짐. Privacy 위협 (anonymous contributor de-anon) ↔ defensive utility (malware attribution, plagiarism detection) 의 양날. ## 매 핵심 ### 매 feature class - **Lexical**: identifier naming (camelCase vs snake_case), keyword frequency. - **Layout**: indentation, brace style, line length. - **Syntactic**: AST node distribution, depth, n-gram of node types. - **Idiomatic**: preferred construct (`for` vs `map`, ternary vs if). - **Embedding-based**: CodeBERT/StarCoder hidden states (2024+). ### 매 attack scenario - De-anonymizing GitHub anonymous account. - Linking malware author across samples. - Plagiarism detection in coursework. - Insider threat attribution. ### 매 응용 1. Forensic attribution (FBI/Interpol cases). 2. Academic integrity (MOSS, JPlag). 3. Bug-injection-source detection (xz-style supply chain). ## 💻 패턴 ### Layout features ```python import re def layout_features(src: str) -> dict: lines = src.split('\n') return { 'avg_line_len': sum(len(l) for l in lines) / max(len(lines), 1), 'tab_ratio': sum(l.startswith('\t') for l in lines) / max(len(lines), 1), 'blank_ratio': sum(not l.strip() for l in lines) / max(len(lines), 1), 'snake_ratio': len(re.findall(r'\b[a-z]+_[a-z]+\b', src)), 'camel_ratio': len(re.findall(r'\b[a-z]+[A-Z][a-z]+\b', src)), } ``` ### AST n-gram (Python) ```python import ast from collections import Counter def ast_ngrams(src: str, n=3): tree = ast.parse(src) seq = [type(node).__name__ for node in ast.walk(tree)] return Counter(tuple(seq[i:i+n]) for i in range(len(seq)-n+1)) ``` ### Random forest classifier (Caliskan-style) ```python from sklearn.ensemble import RandomForestClassifier from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer(sparse=False) X = vec.fit_transform([extract_all_features(s) for s in samples]) clf = RandomForestClassifier(n_estimators=300, max_depth=20) clf.fit(X, authors) print(clf.score(X_test, y_test)) # ~90%+ on 100-author corpus ``` ### CodeBERT embedding classifier (2024+) ```python from transformers import AutoTokenizer, AutoModel import torch tok = AutoTokenizer.from_pretrained('microsoft/codebert-base') model = AutoModel.from_pretrained('microsoft/codebert-base').eval() def embed(src: str) -> torch.Tensor: inp = tok(src, truncation=True, max_length=512, return_tensors='pt') with torch.no_grad(): out = model(**inp).last_hidden_state[:, 0] # CLS return out.squeeze() # Then train linear classifier on embeddings ``` ### Defensive: code anonymizer ```python # Normalize to defeat stylometry import black, autopep8 def anonymize(src: str) -> str: src = black.format_str(src, mode=black.Mode()) # uniform layout # rename identifiers via AST transform # replace idiosyncratic constructs with canonical form return src ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Small corpus (<50 authors) | RF on hand-crafted features | | Large corpus, deep features | CodeBERT/StarCoder embedding + classifier | | Defending privacy | Black/Prettier + identifier normalization | | Adversarial robust attack | Limited — formatting tools 매 defeat 대부분 | | Cross-language | Embedding-based 만 가능 | **기본값**: 매 RF + AST n-gram 으로 baseline. Embedding 으로 boost. ## 🔗 Graph - 부모: [[Authorship Attribution]] - 응용: [[Supply Chain Security]] - Adjacent: [[Code Obfuscation]] · [[AST]] ## 🤖 LLM 활용 **언제**: Forensic context, plagiarism check, OSS contributor analysis. **언제 X**: Identifying anonymous whistleblower — ethical 매 거부. ## ❌ 안티패턴 - **Single-feature reliance**: layout 만 → autoformatter 로 매 trivial defeat. - **Ignoring base rate**: low base rate = high false positive rate (Bonferroni). - **Author-set assumption**: open-world (unknown author) ≠ closed-world. - **Privacy ignored**: deploying on anonymous code 매 ethical review 없이. ## 🧪 검증 / 중복 - Verified (Caliskan USENIX 2015, Abuhamad 2018, CodeBERT papers). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — stylometry features + RF/CodeBERT pipelines |