"매 코드 작성자를 매 stylistic feature 로 식별하는 ML 기법". Caliskan et al. 2015 (USENIX) 가 random forest 로 250 명 중 94% 식별. 매 modern era — CodeBERT/StarCoder embedding 기반 분류기로 매 더 강력해짐. Privacy 위협 (anonymous contributor de-anon) ↔ defensive utility (malware attribution, plagiarism detection) 의 양날.
매 핵심
매 feature class
Lexical: identifier naming (camelCase vs snake_case), keyword frequency.
Layout: indentation, brace style, line length.
Syntactic: AST node distribution, depth, n-gram of node types.
Idiomatic: preferred construct (for vs map, ternary vs if).
Embedding-based: CodeBERT/StarCoder hidden states (2024+).
fromsklearn.ensembleimportRandomForestClassifierfromsklearn.feature_extractionimportDictVectorizervec=DictVectorizer(sparse=False)X=vec.fit_transform([extract_all_features(s)forsinsamples])clf=RandomForestClassifier(n_estimators=300,max_depth=20)clf.fit(X,authors)print(clf.score(X_test,y_test))# ~90%+ on 100-author corpus
CodeBERT embedding classifier (2024+)
fromtransformersimportAutoTokenizer,AutoModelimporttorchtok=AutoTokenizer.from_pretrained('microsoft/codebert-base')model=AutoModel.from_pretrained('microsoft/codebert-base').eval()defembed(src:str)->torch.Tensor:inp=tok(src,truncation=True,max_length=512,return_tensors='pt')withtorch.no_grad():out=model(**inp).last_hidden_state[:,0]# CLSreturnout.squeeze()# Then train linear classifier on embeddings
Defensive: code anonymizer
# Normalize to defeat stylometryimportblack,autopep8defanonymize(src:str)->str:src=black.format_str(src,mode=black.Mode())# uniform layout# rename identifiers via AST transform# replace idiosyncratic constructs with canonical formreturnsrc
매 결정 기준
상황
Approach
Small corpus (<50 authors)
RF on hand-crafted features
Large corpus, deep features
CodeBERT/StarCoder embedding + classifier
Defending privacy
Black/Prettier + identifier normalization
Adversarial robust attack
Limited — formatting tools 매 defeat 대부분
Cross-language
Embedding-based 만 가능
기본값: 매 RF + AST n-gram 으로 baseline. Embedding 으로 boost.