"매 digital fingerprint". 매 compiled binary 의 매 author style 의 detect. 매 control flow + 매 register usage + 매 idiom 의 unique. 매 malware forensic 의 critical. 매 modern: 매 "human vs AI-generated code" 의 detect.
📖 핵심
매 Caliskan-Islam et al. (2015 / 2018)
매 stylometric features 의 binary 의 maintain.
매 100 author 의 96% accuracy.
매 even after compilation + optimization.
매 feature
Source-level (binary recovered)
매 indentation, 매 brace style.
매 variable naming convention.
매 keyword frequency.
매 operator preference.
Binary-level
매 Control Flow Graph (CFG) 구조.
매 function call sequence.
매 register usage pattern.
매 instruction frequency (n-gram).
매 calling convention.
매 padding / alignment.
매 import library 의 set.
Decompilation-aided
매 Ghidra / IDA 의 reverse → source approximation.
매 stylometric on decompiled.
매 ML approach
Classical
Random Forest + 매 hand feature.
Caliskan 2015: 매 binary attribution.
Deep learning
Binary embedding: 매 SAFE, 매 Asm2Vec.
Graph NN on CFG.
Transformer on instruction sequence.
Contrastive
PalmTree, jTrans: 매 binary similarity.
매 응용
Forensic
매 malware authorship attribution.
매 APT group identification.
매 ransomware family.
Open source
매 plagiarism detection.
매 license violation tracking.
Security research
매 vulnerability fingerprint.
매 N-day exploit detection.
AI-generated code detection
매 GitHub Copilot / GPT 의 generated.
매 AI assertion (some papers > 90%).
매 watermark (statistical).
매 challenge
Compiler / optimization 변동: 매 same source 의 다른 binary.
Stripped binary: 매 symbol 의 X.
Obfuscation: 매 anti-stylometry.
Multi-author: 매 commit 의 mix.
Transfer: 매 different language / 다른 platform.
매 anti-stylometry (defense)
Style anonymization: 매 normalization.
Adversarial perturbation.
Code rewriter: 매 syntactic transform.
매 ethics
Whistleblower: 매 anonymous code 의 expose 의 risk.
Open source: 매 author 의 reveal.
Research participant: 매 consent.
Government: 매 dissident code 의 attribution.
💻 패턴
Feature extraction (CFG-based)
importangr# 매 binary analysisdefextract_cfg_features(binary_path):proj=angr.Project(binary_path,auto_load_libs=False)cfg=proj.analyses.CFGFast()return{'n_functions':len(cfg.functions),'avg_basic_blocks':np.mean([len(f.blocks)forfincfg.functions.values()]),'avg_function_size':np.mean([f.sizeforfincfg.functions.values()]),'cyclomatic_complexity':sum(f.cyclomatic_complexityforfincfg.functions.values()),'call_depth_max':cfg.call_graph_max_depth,}
Instruction n-gram
importcapstonedefinstruction_ngrams(binary,n=3):md=capstone.Cs(capstone.CS_ARCH_X86,capstone.CS_MODE_64)instructions=[]forinsninmd.disasm(binary,0x1000):instructions.append(insn.mnemonic)# 매 mov, push, call, ...ngrams=collections.Counter()foriinrange(len(instructions)-n+1):ngrams[tuple(instructions[i:i+n])]+=1returnngrams
Binary embedding (SAFE-style)
importtorchimporttorch.nnasnnclassBinaryEmbedding(nn.Module):"""매 instruction sequence → 매 vector."""def__init__(self,vocab_size=10000,dim=128):super().__init__()self.emb=nn.Embedding(vocab_size,dim)self.lstm=nn.LSTM(dim,dim,batch_first=True,bidirectional=True)self.proj=nn.Linear(dim*2,dim)defforward(self,instruction_ids):x=self.emb(instruction_ids)_,(h,_)=self.lstm(x)h=torch.cat([h[0],h[1]],dim=-1)returnself.proj(h)# 매 train: 매 contrastive (same author 의 close, different 의 far).
Author classifier (RF on features)
fromsklearn.ensembleimportRandomForestClassifierfromsklearn.model_selectionimporttrain_test_split# 매 N author 의 binary 의 collectX,y=[],[]forauthor,binariesindataset.items():forbinbinaries:features=extract_features(b)X.append(features)y.append(author)X_train,X_test,y_train,y_test=train_test_split(X,y,stratify=y)clf=RandomForestClassifier(n_estimators=500,random_state=42)clf.fit(X_train,y_train)print(clf.score(X_test,y_test))
AI vs Human code detection
# 매 modern: GPTZero-style on codedefdetect_ai_code(code,perplexity_threshold=30):fromtransformersimportGPT2LMHeadModel,GPT2Tokenizertok=GPT2Tokenizer.from_pretrained('gpt2')model=GPT2LMHeadModel.from_pretrained('gpt2')inputs=tok(code,return_tensors='pt')withtorch.no_grad():outputs=model(**inputs,labels=inputs['input_ids'])perplexity=torch.exp(outputs.loss).item()returnperplexity<perplexity_threshold# 매 low PP = 매 AI
→ 매 unreliable (false positive 많음).
Style normalization (anti-stylometry)
defnormalize_code_style(source):"""매 stylometric leak 의 reduce."""# 매 indentation 의 standardizesource=re.sub(r'\t',' ',source)# 매 variable renameast_tree=ast.parse(source)# 매 visit + rename — 매 deterministic.# 매 brace style normalize, etc.returnsource
언제: 매 malware analysis. 매 plagiarism check. 매 AI code detection (cautious). 매 forensic investigation.
언제 X: 매 single binary (insufficient sample). 매 anonymous whistleblower 의 expose (ethics).
❌ 안티패턴
Single-feature reliance: 매 single signal 의 spoof.
Stripped binary 의 high confidence: 매 less info.
Closed-source 의 production claim: 매 verify X.
AI-detection (GPTZero-on-code) 의 100% trust: 매 false positive.
Anti-stylometry 의 ignore: 매 author 의 active resist.