---
id: wiki-2026-0508-binary-author-identification
title: Binary Author Identification
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [binary attribution, code stylometry, malware authorship, GAN-detect, AI-vs-human code]
duplicate_of: none
source_trust_level: B
confidence_score: 0.85
verification_status: applied
tags: [security, forensics, binary-analysis, stylometry, malware-attribution, ai-vs-human-code, attribution]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: angr / radare2 / Ghidra / scikit-learn / PyTorch
---

# Binary Author Identification

## 📌 한 줄 통찰
> **"매 digital fingerprint"**. 매 compiled binary 의 매 author style 의 detect. 매 control flow + 매 register usage + 매 idiom 의 unique. 매 malware forensic 의 critical. 매 modern: 매 "human vs AI-generated code" 의 detect.

## 📖 핵심

### 매 Caliskan-Islam et al. (2015 / 2018)
- 매 stylometric features 의 binary 의 maintain.
- 매 100 author 의 96% accuracy.
- 매 even after compilation + optimization.

### 매 feature

#### Source-level (binary recovered)
- 매 indentation, 매 brace style.
- 매 variable naming convention.
- 매 keyword frequency.
- 매 operator preference.

#### Binary-level
- 매 Control Flow Graph (CFG) 구조.
- 매 function call sequence.
- 매 register usage pattern.
- 매 instruction frequency (n-gram).
- 매 calling convention.
- 매 padding / alignment.
- 매 import library 의 set.

#### Decompilation-aided
- 매 Ghidra / IDA 의 reverse → source approximation.
- 매 stylometric on decompiled.

### 매 ML approach

#### Classical
- **Random Forest** + 매 hand feature.
- **Caliskan 2015**: 매 binary attribution.

#### Deep learning
- **Binary embedding**: 매 SAFE, 매 Asm2Vec.
- **Graph NN** on CFG.
- **Transformer** on instruction sequence.

#### Contrastive
- **PalmTree, jTrans**: 매 binary similarity.

### 매 응용

#### Forensic
- 매 malware authorship attribution.
- 매 APT group identification.
- 매 ransomware family.

#### Open source
- 매 plagiarism detection.
- 매 license violation tracking.

#### Security research
- 매 vulnerability fingerprint.
- 매 N-day exploit detection.

#### AI-generated code detection
- 매 GitHub Copilot / GPT 의 generated.
- 매 AI assertion (some papers > 90%).
- 매 watermark (statistical).

### 매 challenge
1. **Compiler / optimization 변동**: 매 same source 의 다른 binary.
2. **Stripped binary**: 매 symbol 의 X.
3. **Obfuscation**: 매 anti-stylometry.
4. **Multi-author**: 매 commit 의 mix.
5. **Transfer**: 매 different language / 다른 platform.

### 매 anti-stylometry (defense)
- **Style anonymization**: 매 normalization.
- **Adversarial perturbation**.
- **Code rewriter**: 매 syntactic transform.

### 매 ethics
- **Whistleblower**: 매 anonymous code 의 expose 의 risk.
- **Open source**: 매 author 의 reveal.
- **Research participant**: 매 consent.
- **Government**: 매 dissident code 의 attribution.

## 💻 패턴

### Feature extraction (CFG-based)
```python
import angr  # 매 binary analysis

def extract_cfg_features(binary_path):
    proj = angr.Project(binary_path, auto_load_libs=False)
    cfg = proj.analyses.CFGFast()
    
    return {
        'n_functions': len(cfg.functions),
        'avg_basic_blocks': np.mean([len(f.blocks) for f in cfg.functions.values()]),
        'avg_function_size': np.mean([f.size for f in cfg.functions.values()]),
        'cyclomatic_complexity': sum(f.cyclomatic_complexity for f in cfg.functions.values()),
        'call_depth_max': cfg.call_graph_max_depth,
    }
```

### Instruction n-gram
```python
import capstone

def instruction_ngrams(binary, n=3):
    md = capstone.Cs(capstone.CS_ARCH_X86, capstone.CS_MODE_64)
    instructions = []
    for insn in md.disasm(binary, 0x1000):
        instructions.append(insn.mnemonic)  # 매 mov, push, call, ...
    
    ngrams = collections.Counter()
    for i in range(len(instructions) - n + 1):
        ngrams[tuple(instructions[i:i+n])] += 1
    return ngrams
```

### Binary embedding (SAFE-style)
```python
import torch
import torch.nn as nn

class BinaryEmbedding(nn.Module):
    """매 instruction sequence → 매 vector."""
    def __init__(self, vocab_size=10000, dim=128):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, dim)
        self.lstm = nn.LSTM(dim, dim, batch_first=True, bidirectional=True)
        self.proj = nn.Linear(dim * 2, dim)
    
    def forward(self, instruction_ids):
        x = self.emb(instruction_ids)
        _, (h, _) = self.lstm(x)
        h = torch.cat([h[0], h[1]], dim=-1)
        return self.proj(h)

# 매 train: 매 contrastive (same author 의 close, different 의 far).
```

### Author classifier (RF on features)
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# 매 N author 의 binary 의 collect
X, y = [], []
for author, binaries in dataset.items():
    for b in binaries:
        features = extract_features(b)
        X.append(features)
        y.append(author)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

clf = RandomForestClassifier(n_estimators=500, random_state=42)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
```

### AI vs Human code detection
```python
# 매 modern: GPTZero-style on code
def detect_ai_code(code, perplexity_threshold=30):
    from transformers import GPT2LMHeadModel, GPT2Tokenizer
    tok = GPT2Tokenizer.from_pretrained('gpt2')
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    
    inputs = tok(code, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs['input_ids'])
    
    perplexity = torch.exp(outputs.loss).item()
    return perplexity < perplexity_threshold  # 매 low PP = 매 AI
```

→ 매 unreliable (false positive 많음).

### Style normalization (anti-stylometry)
```python
def normalize_code_style(source):
    """매 stylometric leak 의 reduce."""
    # 매 indentation 의 standardize
    source = re.sub(r'\t', '    ', source)
    # 매 variable rename
    ast_tree = ast.parse(source)
    # 매 visit + rename — 매 deterministic.
    # 매 brace style normalize, etc.
    return source
```

### Malware family clustering
```python
from sklearn.cluster import DBSCAN

def cluster_malware(binaries):
    embeddings = [embed(b) for b in binaries]
    clusterer = DBSCAN(eps=0.3, min_samples=3, metric='cosine')
    labels = clusterer.fit_predict(embeddings)
    
    families = collections.defaultdict(list)
    for b, lbl in zip(binaries, labels):
        if lbl >= 0: families[lbl].append(b)
    return families
```

## 🤔 결정 기준
| 응용 | Approach |
|---|---|
| Malware attribution | CFG feature + RF |
| Plagiarism (binary) | SAFE / Asm2Vec embedding |
| AI code detection | Perplexity (unreliable) + watermark |
| Cross-compiler | Multi-binary aggregate |
| Stripped binary | Decompiler + style |
| Privacy protect | Style normalization |

**기본값**: 매 hand-feature + RF baseline. 매 SAFE / 매 transformer 의 SOTA.

## 🔗 Graph
- 부모: [[Security]] · [[ESLint-Static-Analysis|Static-Analysis-Linting]]
- 변형: [[Code-Stylometry]]
- Adjacent: [[Authenticity]]

## 🤖 LLM 활용
**언제**: 매 malware analysis. 매 plagiarism check. 매 AI code detection (cautious). 매 forensic investigation.
**언제 X**: 매 single binary (insufficient sample). 매 anonymous whistleblower 의 expose (ethics).

## ❌ 안티패턴
- **Single-feature reliance**: 매 single signal 의 spoof.
- **Stripped binary 의 high confidence**: 매 less info.
- **Closed-source 의 production claim**: 매 verify X.
- **AI-detection (GPTZero-on-code)** 의 100% trust: 매 false positive.
- **Anti-stylometry 의 ignore**: 매 author 의 active resist.

## 🧪 검증 / 중복
- Verified (Caliskan-Islam 2015 USENIX, SAFE 2019, Asm2Vec).
- 신뢰도 B (active research).
- Related: [[AI-Generated-Code-Assurance]] · [[Authenticity]] · [[ESLint-Static-Analysis|Static-Analysis-Linting]] · [[Watermarking]].

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — feature + ML approach + 매 angr / capstone / RF / SAFE code + AI detection |