--- id: wiki-2026-0508-binary-author-identification title: Binary Author Identification category: 10_Wiki/Topics status: verified canonical_id: self aliases: [binary attribution, code stylometry, malware authorship, GAN-detect, AI-vs-human code] duplicate_of: none source_trust_level: B confidence_score: 0.85 verification_status: applied tags: [security, forensics, binary-analysis, stylometry, malware-attribution, ai-vs-human-code, attribution] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: angr / radare2 / Ghidra / scikit-learn / PyTorch --- # Binary Author Identification ## 📌 한 줄 통찰 > **"매 digital fingerprint"**. 매 compiled binary 의 매 author style 의 detect. 매 control flow + 매 register usage + 매 idiom 의 unique. 매 malware forensic 의 critical. 매 modern: 매 "human vs AI-generated code" 의 detect. ## 📖 핵심 ### 매 Caliskan-Islam et al. (2015 / 2018) - 매 stylometric features 의 binary 의 maintain. - 매 100 author 의 96% accuracy. - 매 even after compilation + optimization. ### 매 feature #### Source-level (binary recovered) - 매 indentation, 매 brace style. - 매 variable naming convention. - 매 keyword frequency. - 매 operator preference. #### Binary-level - 매 Control Flow Graph (CFG) 구조. - 매 function call sequence. - 매 register usage pattern. - 매 instruction frequency (n-gram). - 매 calling convention. - 매 padding / alignment. - 매 import library 의 set. #### Decompilation-aided - 매 Ghidra / IDA 의 reverse → source approximation. - 매 stylometric on decompiled. ### 매 ML approach #### Classical - **Random Forest** + 매 hand feature. - **Caliskan 2015**: 매 binary attribution. #### Deep learning - **Binary embedding**: 매 SAFE, 매 Asm2Vec. - **Graph NN** on CFG. - **Transformer** on instruction sequence. #### Contrastive - **PalmTree, jTrans**: 매 binary similarity. ### 매 응용 #### Forensic - 매 malware authorship attribution. - 매 APT group identification. - 매 ransomware family. #### Open source - 매 plagiarism detection. - 매 license violation tracking. #### Security research - 매 vulnerability fingerprint. - 매 N-day exploit detection. #### AI-generated code detection - 매 GitHub Copilot / GPT 의 generated. - 매 AI assertion (some papers > 90%). - 매 watermark (statistical). ### 매 challenge 1. **Compiler / optimization 변동**: 매 same source 의 다른 binary. 2. **Stripped binary**: 매 symbol 의 X. 3. **Obfuscation**: 매 anti-stylometry. 4. **Multi-author**: 매 commit 의 mix. 5. **Transfer**: 매 different language / 다른 platform. ### 매 anti-stylometry (defense) - **Style anonymization**: 매 normalization. - **Adversarial perturbation**. - **Code rewriter**: 매 syntactic transform. ### 매 ethics - **Whistleblower**: 매 anonymous code 의 expose 의 risk. - **Open source**: 매 author 의 reveal. - **Research participant**: 매 consent. - **Government**: 매 dissident code 의 attribution. ## 💻 패턴 ### Feature extraction (CFG-based) ```python import angr # 매 binary analysis def extract_cfg_features(binary_path): proj = angr.Project(binary_path, auto_load_libs=False) cfg = proj.analyses.CFGFast() return { 'n_functions': len(cfg.functions), 'avg_basic_blocks': np.mean([len(f.blocks) for f in cfg.functions.values()]), 'avg_function_size': np.mean([f.size for f in cfg.functions.values()]), 'cyclomatic_complexity': sum(f.cyclomatic_complexity for f in cfg.functions.values()), 'call_depth_max': cfg.call_graph_max_depth, } ``` ### Instruction n-gram ```python import capstone def instruction_ngrams(binary, n=3): md = capstone.Cs(capstone.CS_ARCH_X86, capstone.CS_MODE_64) instructions = [] for insn in md.disasm(binary, 0x1000): instructions.append(insn.mnemonic) # 매 mov, push, call, ... ngrams = collections.Counter() for i in range(len(instructions) - n + 1): ngrams[tuple(instructions[i:i+n])] += 1 return ngrams ``` ### Binary embedding (SAFE-style) ```python import torch import torch.nn as nn class BinaryEmbedding(nn.Module): """매 instruction sequence → 매 vector.""" def __init__(self, vocab_size=10000, dim=128): super().__init__() self.emb = nn.Embedding(vocab_size, dim) self.lstm = nn.LSTM(dim, dim, batch_first=True, bidirectional=True) self.proj = nn.Linear(dim * 2, dim) def forward(self, instruction_ids): x = self.emb(instruction_ids) _, (h, _) = self.lstm(x) h = torch.cat([h[0], h[1]], dim=-1) return self.proj(h) # 매 train: 매 contrastive (same author 의 close, different 의 far). ``` ### Author classifier (RF on features) ```python from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split # 매 N author 의 binary 의 collect X, y = [], [] for author, binaries in dataset.items(): for b in binaries: features = extract_features(b) X.append(features) y.append(author) X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y) clf = RandomForestClassifier(n_estimators=500, random_state=42) clf.fit(X_train, y_train) print(clf.score(X_test, y_test)) ``` ### AI vs Human code detection ```python # 매 modern: GPTZero-style on code def detect_ai_code(code, perplexity_threshold=30): from transformers import GPT2LMHeadModel, GPT2Tokenizer tok = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2LMHeadModel.from_pretrained('gpt2') inputs = tok(code, return_tensors='pt') with torch.no_grad(): outputs = model(**inputs, labels=inputs['input_ids']) perplexity = torch.exp(outputs.loss).item() return perplexity < perplexity_threshold # 매 low PP = 매 AI ``` → 매 unreliable (false positive 많음). ### Style normalization (anti-stylometry) ```python def normalize_code_style(source): """매 stylometric leak 의 reduce.""" # 매 indentation 의 standardize source = re.sub(r'\t', ' ', source) # 매 variable rename ast_tree = ast.parse(source) # 매 visit + rename — 매 deterministic. # 매 brace style normalize, etc. return source ``` ### Malware family clustering ```python from sklearn.cluster import DBSCAN def cluster_malware(binaries): embeddings = [embed(b) for b in binaries] clusterer = DBSCAN(eps=0.3, min_samples=3, metric='cosine') labels = clusterer.fit_predict(embeddings) families = collections.defaultdict(list) for b, lbl in zip(binaries, labels): if lbl >= 0: families[lbl].append(b) return families ``` ## 🤔 결정 기준 | 응용 | Approach | |---|---| | Malware attribution | CFG feature + RF | | Plagiarism (binary) | SAFE / Asm2Vec embedding | | AI code detection | Perplexity (unreliable) + watermark | | Cross-compiler | Multi-binary aggregate | | Stripped binary | Decompiler + style | | Privacy protect | Style normalization | **기본값**: 매 hand-feature + RF baseline. 매 SAFE / 매 transformer 의 SOTA. ## 🔗 Graph - 부모: [[Security]] · [[ESLint-Static-Analysis|Static-Analysis-Linting]] - 변형: [[Code-Stylometry]] - Adjacent: [[Authenticity]] ## 🤖 LLM 활용 **언제**: 매 malware analysis. 매 plagiarism check. 매 AI code detection (cautious). 매 forensic investigation. **언제 X**: 매 single binary (insufficient sample). 매 anonymous whistleblower 의 expose (ethics). ## ❌ 안티패턴 - **Single-feature reliance**: 매 single signal 의 spoof. - **Stripped binary 의 high confidence**: 매 less info. - **Closed-source 의 production claim**: 매 verify X. - **AI-detection (GPTZero-on-code)** 의 100% trust: 매 false positive. - **Anti-stylometry 의 ignore**: 매 author 의 active resist. ## 🧪 검증 / 중복 - Verified (Caliskan-Islam 2015 USENIX, SAFE 2019, Asm2Vec). - 신뢰도 B (active research). - Related: [[AI-Generated-Code-Assurance]] · [[Authenticity]] · [[ESLint-Static-Analysis|Static-Analysis-Linting]] · [[Watermarking]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — feature + ML approach + 매 angr / capstone / RF / SAFE code + AI detection |