Files
2nd/10_Wiki/Topics/DevOps_and_Security/Code Stylometry (코드 문체론).md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

4.9 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-code-stylometry-코드-문체론 Code Stylometry (코드 문체론) 10_Wiki/Topics verified self
Authorship Attribution
Code Fingerprinting
Programmer Identification
none A 0.9 applied
security
ml
forensics
privacy
2026-05-10 pending
language framework
Python scikit-learn/transformers

Code Stylometry (코드 문체론)

매 한 줄

"매 코드 작성자를 매 stylistic feature 로 식별하는 ML 기법". Caliskan et al. 2015 (USENIX) 가 random forest 로 250 명 중 94% 식별. 매 modern era — CodeBERT/StarCoder embedding 기반 분류기로 매 더 강력해짐. Privacy 위협 (anonymous contributor de-anon) ↔ defensive utility (malware attribution, plagiarism detection) 의 양날.

매 핵심

매 feature class

  • Lexical: identifier naming (camelCase vs snake_case), keyword frequency.
  • Layout: indentation, brace style, line length.
  • Syntactic: AST node distribution, depth, n-gram of node types.
  • Idiomatic: preferred construct (for vs map, ternary vs if).
  • Embedding-based: CodeBERT/StarCoder hidden states (2024+).

매 attack scenario

  • De-anonymizing GitHub anonymous account.
  • Linking malware author across samples.
  • Plagiarism detection in coursework.
  • Insider threat attribution.

매 응용

  1. Forensic attribution (FBI/Interpol cases).
  2. Academic integrity (MOSS, JPlag).
  3. Bug-injection-source detection (xz-style supply chain).

💻 패턴

Layout features

import re
def layout_features(src: str) -> dict:
    lines = src.split('\n')
    return {
        'avg_line_len': sum(len(l) for l in lines) / max(len(lines), 1),
        'tab_ratio': sum(l.startswith('\t') for l in lines) / max(len(lines), 1),
        'blank_ratio': sum(not l.strip() for l in lines) / max(len(lines), 1),
        'snake_ratio': len(re.findall(r'\b[a-z]+_[a-z]+\b', src)),
        'camel_ratio': len(re.findall(r'\b[a-z]+[A-Z][a-z]+\b', src)),
    }

AST n-gram (Python)

import ast
from collections import Counter

def ast_ngrams(src: str, n=3):
    tree = ast.parse(src)
    seq = [type(node).__name__ for node in ast.walk(tree)]
    return Counter(tuple(seq[i:i+n]) for i in range(len(seq)-n+1))

Random forest classifier (Caliskan-style)

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
X = vec.fit_transform([extract_all_features(s) for s in samples])
clf = RandomForestClassifier(n_estimators=300, max_depth=20)
clf.fit(X, authors)
print(clf.score(X_test, y_test))  # ~90%+ on 100-author corpus

CodeBERT embedding classifier (2024+)

from transformers import AutoTokenizer, AutoModel
import torch

tok = AutoTokenizer.from_pretrained('microsoft/codebert-base')
model = AutoModel.from_pretrained('microsoft/codebert-base').eval()

def embed(src: str) -> torch.Tensor:
    inp = tok(src, truncation=True, max_length=512, return_tensors='pt')
    with torch.no_grad():
        out = model(**inp).last_hidden_state[:, 0]  # CLS
    return out.squeeze()

# Then train linear classifier on embeddings

Defensive: code anonymizer

# Normalize to defeat stylometry
import black, autopep8
def anonymize(src: str) -> str:
    src = black.format_str(src, mode=black.Mode())  # uniform layout
    # rename identifiers via AST transform
    # replace idiosyncratic constructs with canonical form
    return src

매 결정 기준

상황 Approach
Small corpus (<50 authors) RF on hand-crafted features
Large corpus, deep features CodeBERT/StarCoder embedding + classifier
Defending privacy Black/Prettier + identifier normalization
Adversarial robust attack Limited — formatting tools 매 defeat 대부분
Cross-language Embedding-based 만 가능

기본값: 매 RF + AST n-gram 으로 baseline. Embedding 으로 boost.

🔗 Graph

🤖 LLM 활용

언제: Forensic context, plagiarism check, OSS contributor analysis. 언제 X: Identifying anonymous whistleblower — ethical 매 거부.

안티패턴

  • Single-feature reliance: layout 만 → autoformatter 로 매 trivial defeat.
  • Ignoring base rate: low base rate = high false positive rate (Bonferroni).
  • Author-set assumption: open-world (unknown author) ≠ closed-world.
  • Privacy ignored: deploying on anonymous code 매 ethical review 없이.

🧪 검증 / 중복

  • Verified (Caliskan USENIX 2015, Abuhamad 2018, CodeBERT papers).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — stylometry features + RF/CodeBERT pipelines