f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.1 KiB
8.1 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-binary-author-identification | Binary Author Identification | 10_Wiki/Topics | verified | self |
|
none | B | 0.85 | applied |
|
2026-05-10 | pending |
|
Binary Author Identification
📌 한 줄 통찰
"매 digital fingerprint". 매 compiled binary 의 매 author style 의 detect. 매 control flow + 매 register usage + 매 idiom 의 unique. 매 malware forensic 의 critical. 매 modern: 매 "human vs AI-generated code" 의 detect.
📖 핵심
매 Caliskan-Islam et al. (2015 / 2018)
- 매 stylometric features 의 binary 의 maintain.
- 매 100 author 의 96% accuracy.
- 매 even after compilation + optimization.
매 feature
Source-level (binary recovered)
- 매 indentation, 매 brace style.
- 매 variable naming convention.
- 매 keyword frequency.
- 매 operator preference.
Binary-level
- 매 Control Flow Graph (CFG) 구조.
- 매 function call sequence.
- 매 register usage pattern.
- 매 instruction frequency (n-gram).
- 매 calling convention.
- 매 padding / alignment.
- 매 import library 의 set.
Decompilation-aided
- 매 Ghidra / IDA 의 reverse → source approximation.
- 매 stylometric on decompiled.
매 ML approach
Classical
- Random Forest + 매 hand feature.
- Caliskan 2015: 매 binary attribution.
Deep learning
- Binary embedding: 매 SAFE, 매 Asm2Vec.
- Graph NN on CFG.
- Transformer on instruction sequence.
Contrastive
- PalmTree, jTrans: 매 binary similarity.
매 응용
Forensic
- 매 malware authorship attribution.
- 매 APT group identification.
- 매 ransomware family.
Open source
- 매 plagiarism detection.
- 매 license violation tracking.
Security research
- 매 vulnerability fingerprint.
- 매 N-day exploit detection.
AI-generated code detection
- 매 GitHub Copilot / GPT 의 generated.
- 매 AI assertion (some papers > 90%).
- 매 watermark (statistical).
매 challenge
- Compiler / optimization 변동: 매 same source 의 다른 binary.
- Stripped binary: 매 symbol 의 X.
- Obfuscation: 매 anti-stylometry.
- Multi-author: 매 commit 의 mix.
- Transfer: 매 different language / 다른 platform.
매 anti-stylometry (defense)
- Style anonymization: 매 normalization.
- Adversarial perturbation.
- Code rewriter: 매 syntactic transform.
매 ethics
- Whistleblower: 매 anonymous code 의 expose 의 risk.
- Open source: 매 author 의 reveal.
- Research participant: 매 consent.
- Government: 매 dissident code 의 attribution.
💻 패턴
Feature extraction (CFG-based)
import angr # 매 binary analysis
def extract_cfg_features(binary_path):
proj = angr.Project(binary_path, auto_load_libs=False)
cfg = proj.analyses.CFGFast()
return {
'n_functions': len(cfg.functions),
'avg_basic_blocks': np.mean([len(f.blocks) for f in cfg.functions.values()]),
'avg_function_size': np.mean([f.size for f in cfg.functions.values()]),
'cyclomatic_complexity': sum(f.cyclomatic_complexity for f in cfg.functions.values()),
'call_depth_max': cfg.call_graph_max_depth,
}
Instruction n-gram
import capstone
def instruction_ngrams(binary, n=3):
md = capstone.Cs(capstone.CS_ARCH_X86, capstone.CS_MODE_64)
instructions = []
for insn in md.disasm(binary, 0x1000):
instructions.append(insn.mnemonic) # 매 mov, push, call, ...
ngrams = collections.Counter()
for i in range(len(instructions) - n + 1):
ngrams[tuple(instructions[i:i+n])] += 1
return ngrams
Binary embedding (SAFE-style)
import torch
import torch.nn as nn
class BinaryEmbedding(nn.Module):
"""매 instruction sequence → 매 vector."""
def __init__(self, vocab_size=10000, dim=128):
super().__init__()
self.emb = nn.Embedding(vocab_size, dim)
self.lstm = nn.LSTM(dim, dim, batch_first=True, bidirectional=True)
self.proj = nn.Linear(dim * 2, dim)
def forward(self, instruction_ids):
x = self.emb(instruction_ids)
_, (h, _) = self.lstm(x)
h = torch.cat([h[0], h[1]], dim=-1)
return self.proj(h)
# 매 train: 매 contrastive (same author 의 close, different 의 far).
Author classifier (RF on features)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# 매 N author 의 binary 의 collect
X, y = [], []
for author, binaries in dataset.items():
for b in binaries:
features = extract_features(b)
X.append(features)
y.append(author)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
clf = RandomForestClassifier(n_estimators=500, random_state=42)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
AI vs Human code detection
# 매 modern: GPTZero-style on code
def detect_ai_code(code, perplexity_threshold=30):
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tok = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
inputs = tok(code, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs, labels=inputs['input_ids'])
perplexity = torch.exp(outputs.loss).item()
return perplexity < perplexity_threshold # 매 low PP = 매 AI
→ 매 unreliable (false positive 많음).
Style normalization (anti-stylometry)
def normalize_code_style(source):
"""매 stylometric leak 의 reduce."""
# 매 indentation 의 standardize
source = re.sub(r'\t', ' ', source)
# 매 variable rename
ast_tree = ast.parse(source)
# 매 visit + rename — 매 deterministic.
# 매 brace style normalize, etc.
return source
Malware family clustering
from sklearn.cluster import DBSCAN
def cluster_malware(binaries):
embeddings = [embed(b) for b in binaries]
clusterer = DBSCAN(eps=0.3, min_samples=3, metric='cosine')
labels = clusterer.fit_predict(embeddings)
families = collections.defaultdict(list)
for b, lbl in zip(binaries, labels):
if lbl >= 0: families[lbl].append(b)
return families
🤔 결정 기준
| 응용 | Approach |
|---|---|
| Malware attribution | CFG feature + RF |
| Plagiarism (binary) | SAFE / Asm2Vec embedding |
| AI code detection | Perplexity (unreliable) + watermark |
| Cross-compiler | Multi-binary aggregate |
| Stripped binary | Decompiler + style |
| Privacy protect | Style normalization |
기본값: 매 hand-feature + RF baseline. 매 SAFE / 매 transformer 의 SOTA.
🔗 Graph
- 부모: Security · ESLint-Static-Analysis
- 변형: Code-Stylometry
- Adjacent: Authenticity
🤖 LLM 활용
언제: 매 malware analysis. 매 plagiarism check. 매 AI code detection (cautious). 매 forensic investigation. 언제 X: 매 single binary (insufficient sample). 매 anonymous whistleblower 의 expose (ethics).
❌ 안티패턴
- Single-feature reliance: 매 single signal 의 spoof.
- Stripped binary 의 high confidence: 매 less info.
- Closed-source 의 production claim: 매 verify X.
- AI-detection (GPTZero-on-code) 의 100% trust: 매 false positive.
- Anti-stylometry 의 ignore: 매 author 의 active resist.
🧪 검증 / 중복
- Verified (Caliskan-Islam 2015 USENIX, SAFE 2019, Asm2Vec).
- 신뢰도 B (active research).
- Related: AI-Generated-Code-Assurance · Authenticity · ESLint-Static-Analysis · Watermarking.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — feature + ML approach + 매 angr / capstone / RF / SAFE code + AI detection |