Files
2nd/10_Wiki/Topics/AI_and_ML/Binary-Author-Identification.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

8.1 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-binary-author-identification Binary Author Identification 10_Wiki/Topics verified self
binary attribution
code stylometry
malware authorship
GAN-detect
AI-vs-human code
none B 0.85 applied
security
forensics
binary-analysis
stylometry
malware-attribution
ai-vs-human-code
attribution
2026-05-10 pending
language framework
Python angr / radare2 / Ghidra / scikit-learn / PyTorch

Binary Author Identification

📌 한 줄 통찰

"매 digital fingerprint". 매 compiled binary 의 매 author style 의 detect. 매 control flow + 매 register usage + 매 idiom 의 unique. 매 malware forensic 의 critical. 매 modern: 매 "human vs AI-generated code" 의 detect.

📖 핵심

매 Caliskan-Islam et al. (2015 / 2018)

  • 매 stylometric features 의 binary 의 maintain.
  • 매 100 author 의 96% accuracy.
  • 매 even after compilation + optimization.

매 feature

Source-level (binary recovered)

  • 매 indentation, 매 brace style.
  • 매 variable naming convention.
  • 매 keyword frequency.
  • 매 operator preference.

Binary-level

  • 매 Control Flow Graph (CFG) 구조.
  • 매 function call sequence.
  • 매 register usage pattern.
  • 매 instruction frequency (n-gram).
  • 매 calling convention.
  • 매 padding / alignment.
  • 매 import library 의 set.

Decompilation-aided

  • 매 Ghidra / IDA 의 reverse → source approximation.
  • 매 stylometric on decompiled.

매 ML approach

Classical

  • Random Forest + 매 hand feature.
  • Caliskan 2015: 매 binary attribution.

Deep learning

  • Binary embedding: 매 SAFE, 매 Asm2Vec.
  • Graph NN on CFG.
  • Transformer on instruction sequence.

Contrastive

  • PalmTree, jTrans: 매 binary similarity.

매 응용

Forensic

  • 매 malware authorship attribution.
  • 매 APT group identification.
  • 매 ransomware family.

Open source

  • 매 plagiarism detection.
  • 매 license violation tracking.

Security research

  • 매 vulnerability fingerprint.
  • 매 N-day exploit detection.

AI-generated code detection

  • 매 GitHub Copilot / GPT 의 generated.
  • 매 AI assertion (some papers > 90%).
  • 매 watermark (statistical).

매 challenge

  1. Compiler / optimization 변동: 매 same source 의 다른 binary.
  2. Stripped binary: 매 symbol 의 X.
  3. Obfuscation: 매 anti-stylometry.
  4. Multi-author: 매 commit 의 mix.
  5. Transfer: 매 different language / 다른 platform.

매 anti-stylometry (defense)

  • Style anonymization: 매 normalization.
  • Adversarial perturbation.
  • Code rewriter: 매 syntactic transform.

매 ethics

  • Whistleblower: 매 anonymous code 의 expose 의 risk.
  • Open source: 매 author 의 reveal.
  • Research participant: 매 consent.
  • Government: 매 dissident code 의 attribution.

💻 패턴

Feature extraction (CFG-based)

import angr  # 매 binary analysis

def extract_cfg_features(binary_path):
    proj = angr.Project(binary_path, auto_load_libs=False)
    cfg = proj.analyses.CFGFast()
    
    return {
        'n_functions': len(cfg.functions),
        'avg_basic_blocks': np.mean([len(f.blocks) for f in cfg.functions.values()]),
        'avg_function_size': np.mean([f.size for f in cfg.functions.values()]),
        'cyclomatic_complexity': sum(f.cyclomatic_complexity for f in cfg.functions.values()),
        'call_depth_max': cfg.call_graph_max_depth,
    }

Instruction n-gram

import capstone

def instruction_ngrams(binary, n=3):
    md = capstone.Cs(capstone.CS_ARCH_X86, capstone.CS_MODE_64)
    instructions = []
    for insn in md.disasm(binary, 0x1000):
        instructions.append(insn.mnemonic)  # 매 mov, push, call, ...
    
    ngrams = collections.Counter()
    for i in range(len(instructions) - n + 1):
        ngrams[tuple(instructions[i:i+n])] += 1
    return ngrams

Binary embedding (SAFE-style)

import torch
import torch.nn as nn

class BinaryEmbedding(nn.Module):
    """매 instruction sequence → 매 vector."""
    def __init__(self, vocab_size=10000, dim=128):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, dim)
        self.lstm = nn.LSTM(dim, dim, batch_first=True, bidirectional=True)
        self.proj = nn.Linear(dim * 2, dim)
    
    def forward(self, instruction_ids):
        x = self.emb(instruction_ids)
        _, (h, _) = self.lstm(x)
        h = torch.cat([h[0], h[1]], dim=-1)
        return self.proj(h)

# 매 train: 매 contrastive (same author 의 close, different 의 far).

Author classifier (RF on features)

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# 매 N author 의 binary 의 collect
X, y = [], []
for author, binaries in dataset.items():
    for b in binaries:
        features = extract_features(b)
        X.append(features)
        y.append(author)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

clf = RandomForestClassifier(n_estimators=500, random_state=42)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

AI vs Human code detection

# 매 modern: GPTZero-style on code
def detect_ai_code(code, perplexity_threshold=30):
    from transformers import GPT2LMHeadModel, GPT2Tokenizer
    tok = GPT2Tokenizer.from_pretrained('gpt2')
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    
    inputs = tok(code, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs['input_ids'])
    
    perplexity = torch.exp(outputs.loss).item()
    return perplexity < perplexity_threshold  # 매 low PP = 매 AI

→ 매 unreliable (false positive 많음).

Style normalization (anti-stylometry)

def normalize_code_style(source):
    """매 stylometric leak 의 reduce."""
    # 매 indentation 의 standardize
    source = re.sub(r'\t', '    ', source)
    # 매 variable rename
    ast_tree = ast.parse(source)
    # 매 visit + rename — 매 deterministic.
    # 매 brace style normalize, etc.
    return source

Malware family clustering

from sklearn.cluster import DBSCAN

def cluster_malware(binaries):
    embeddings = [embed(b) for b in binaries]
    clusterer = DBSCAN(eps=0.3, min_samples=3, metric='cosine')
    labels = clusterer.fit_predict(embeddings)
    
    families = collections.defaultdict(list)
    for b, lbl in zip(binaries, labels):
        if lbl >= 0: families[lbl].append(b)
    return families

🤔 결정 기준

응용 Approach
Malware attribution CFG feature + RF
Plagiarism (binary) SAFE / Asm2Vec embedding
AI code detection Perplexity (unreliable) + watermark
Cross-compiler Multi-binary aggregate
Stripped binary Decompiler + style
Privacy protect Style normalization

기본값: 매 hand-feature + RF baseline. 매 SAFE / 매 transformer 의 SOTA.

🔗 Graph

🤖 LLM 활용

언제: 매 malware analysis. 매 plagiarism check. 매 AI code detection (cautious). 매 forensic investigation. 언제 X: 매 single binary (insufficient sample). 매 anonymous whistleblower 의 expose (ethics).

안티패턴

  • Single-feature reliance: 매 single signal 의 spoof.
  • Stripped binary 의 high confidence: 매 less info.
  • Closed-source 의 production claim: 매 verify X.
  • AI-detection (GPTZero-on-code) 의 100% trust: 매 false positive.
  • Anti-stylometry 의 ignore: 매 author 의 active resist.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — feature + ML approach + 매 angr / capstone / RF / SAFE code + AI detection