149 lines
5.1 KiB
Markdown
149 lines
5.1 KiB
Markdown
---
|
|
id: wiki-2026-0508-code-stylometry-코드-문체론
|
|
title: Code Stylometry (코드 문체론)
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Authorship Attribution, Code Fingerprinting, Programmer Identification]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [security, ml, forensics, privacy]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: Python
|
|
framework: scikit-learn/transformers
|
|
---
|
|
|
|
# Code Stylometry (코드 문체론)
|
|
|
|
## 매 한 줄
|
|
> **"매 코드 작성자를 매 stylistic feature 로 식별하는 ML 기법"**. Caliskan et al. 2015 (USENIX) 가 random forest 로 250 명 중 94% 식별. 매 modern era — CodeBERT/StarCoder embedding 기반 분류기로 매 더 강력해짐. Privacy 위협 (anonymous contributor de-anon) ↔ defensive utility (malware attribution, plagiarism detection) 의 양날.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 feature class
|
|
- **Lexical**: identifier naming (camelCase vs snake_case), keyword frequency.
|
|
- **Layout**: indentation, brace style, line length.
|
|
- **Syntactic**: AST node distribution, depth, n-gram of node types.
|
|
- **Idiomatic**: preferred construct (`for` vs `map`, ternary vs if).
|
|
- **Embedding-based**: CodeBERT/StarCoder hidden states (2024+).
|
|
|
|
### 매 attack scenario
|
|
- De-anonymizing GitHub anonymous account.
|
|
- Linking malware author across samples.
|
|
- Plagiarism detection in coursework.
|
|
- Insider threat attribution.
|
|
|
|
### 매 응용
|
|
1. Forensic attribution (FBI/Interpol cases).
|
|
2. Academic integrity (MOSS, JPlag).
|
|
3. Bug-injection-source detection (xz-style supply chain).
|
|
|
|
## 💻 패턴
|
|
|
|
### Layout features
|
|
```python
|
|
import re
|
|
def layout_features(src: str) -> dict:
|
|
lines = src.split('\n')
|
|
return {
|
|
'avg_line_len': sum(len(l) for l in lines) / max(len(lines), 1),
|
|
'tab_ratio': sum(l.startswith('\t') for l in lines) / max(len(lines), 1),
|
|
'blank_ratio': sum(not l.strip() for l in lines) / max(len(lines), 1),
|
|
'snake_ratio': len(re.findall(r'\b[a-z]+_[a-z]+\b', src)),
|
|
'camel_ratio': len(re.findall(r'\b[a-z]+[A-Z][a-z]+\b', src)),
|
|
}
|
|
```
|
|
|
|
### AST n-gram (Python)
|
|
```python
|
|
import ast
|
|
from collections import Counter
|
|
|
|
def ast_ngrams(src: str, n=3):
|
|
tree = ast.parse(src)
|
|
seq = [type(node).__name__ for node in ast.walk(tree)]
|
|
return Counter(tuple(seq[i:i+n]) for i in range(len(seq)-n+1))
|
|
```
|
|
|
|
### Random forest classifier (Caliskan-style)
|
|
```python
|
|
from sklearn.ensemble import RandomForestClassifier
|
|
from sklearn.feature_extraction import DictVectorizer
|
|
|
|
vec = DictVectorizer(sparse=False)
|
|
X = vec.fit_transform([extract_all_features(s) for s in samples])
|
|
clf = RandomForestClassifier(n_estimators=300, max_depth=20)
|
|
clf.fit(X, authors)
|
|
print(clf.score(X_test, y_test)) # ~90%+ on 100-author corpus
|
|
```
|
|
|
|
### CodeBERT embedding classifier (2024+)
|
|
```python
|
|
from transformers import AutoTokenizer, AutoModel
|
|
import torch
|
|
|
|
tok = AutoTokenizer.from_pretrained('microsoft/codebert-base')
|
|
model = AutoModel.from_pretrained('microsoft/codebert-base').eval()
|
|
|
|
def embed(src: str) -> torch.Tensor:
|
|
inp = tok(src, truncation=True, max_length=512, return_tensors='pt')
|
|
with torch.no_grad():
|
|
out = model(**inp).last_hidden_state[:, 0] # CLS
|
|
return out.squeeze()
|
|
|
|
# Then train linear classifier on embeddings
|
|
```
|
|
|
|
### Defensive: code anonymizer
|
|
```python
|
|
# Normalize to defeat stylometry
|
|
import black, autopep8
|
|
def anonymize(src: str) -> str:
|
|
src = black.format_str(src, mode=black.Mode()) # uniform layout
|
|
# rename identifiers via AST transform
|
|
# replace idiosyncratic constructs with canonical form
|
|
return src
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Small corpus (<50 authors) | RF on hand-crafted features |
|
|
| Large corpus, deep features | CodeBERT/StarCoder embedding + classifier |
|
|
| Defending privacy | Black/Prettier + identifier normalization |
|
|
| Adversarial robust attack | Limited — formatting tools 매 defeat 대부분 |
|
|
| Cross-language | Embedding-based 만 가능 |
|
|
|
|
**기본값**: 매 RF + AST n-gram 으로 baseline. Embedding 으로 boost.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Authorship Attribution]] · [[Software Forensics]]
|
|
- 변형: [[Natural Language Stylometry]] · [[Binary Authorship Attribution]]
|
|
- 응용: [[Plagiarism Detection]] · [[Malware Attribution]] · [[Supply Chain Security]]
|
|
- Adjacent: [[Code Obfuscation]] · [[CodeBERT]] · [[AST]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: Forensic context, plagiarism check, OSS contributor analysis.
|
|
**언제 X**: Identifying anonymous whistleblower — ethical 매 거부.
|
|
|
|
## ❌ 안티패턴
|
|
- **Single-feature reliance**: layout 만 → autoformatter 로 매 trivial defeat.
|
|
- **Ignoring base rate**: low base rate = high false positive rate (Bonferroni).
|
|
- **Author-set assumption**: open-world (unknown author) ≠ closed-world.
|
|
- **Privacy ignored**: deploying on anonymous code 매 ethical review 없이.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Caliskan USENIX 2015, Abuhamad 2018, CodeBERT papers).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — stylometry features + RF/CodeBERT pipelines |
|