Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

8.3 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Binary Author Identification

📌 한 줄 통찰

"매 digital fingerprint". 매 compiled binary 의 매 author style 의 detect. 매 control flow + 매 register usage + 매 idiom 의 unique. 매 malware forensic 의 critical. 매 modern: 매 "human vs AI-generated code" 의 detect.

📖 핵심

매 Caliskan-Islam et al. (2015 / 2018)

매 stylometric features 의 binary 의 maintain.
매 100 author 의 96% accuracy.
매 even after compilation + optimization.

매 feature

Source-level (binary recovered)

매 indentation, 매 brace style.
매 variable naming convention.
매 keyword frequency.
매 operator preference.

Binary-level

매 Control Flow Graph (CFG) 구조.
매 function call sequence.
매 register usage pattern.
매 instruction frequency (n-gram).
매 calling convention.
매 padding / alignment.
매 import library 의 set.

Decompilation-aided

매 Ghidra / IDA 의 reverse → source approximation.
매 stylometric on decompiled.

매 ML approach

Classical

Random Forest + 매 hand feature.
Caliskan 2015: 매 binary attribution.

Deep learning

Binary embedding: 매 SAFE, 매 Asm2Vec.
Graph NN on CFG.
Transformer on instruction sequence.

Contrastive

PalmTree, jTrans: 매 binary similarity.

매 응용

Forensic

매 malware authorship attribution.
매 APT group identification.
매 ransomware family.

Open source

매 plagiarism detection.
매 license violation tracking.

Security research

매 vulnerability fingerprint.
매 N-day exploit detection.

AI-generated code detection

매 GitHub Copilot / GPT 의 generated.
매 AI assertion (some papers > 90%).
매 watermark (statistical).

매 challenge

Compiler / optimization 변동: 매 same source 의 다른 binary.
Stripped binary: 매 symbol 의 X.
Obfuscation: 매 anti-stylometry.
Multi-author: 매 commit 의 mix.
Transfer: 매 different language / 다른 platform.

매 anti-stylometry (defense)

Style anonymization: 매 normalization.
Adversarial perturbation.
Code rewriter: 매 syntactic transform.

매 ethics

Whistleblower: 매 anonymous code 의 expose 의 risk.
Open source: 매 author 의 reveal.
Research participant: 매 consent.
Government: 매 dissident code 의 attribution.

💻 패턴

Feature extraction (CFG-based)

import angr  # 매 binary analysis

def extract_cfg_features(binary_path):
    proj = angr.Project(binary_path, auto_load_libs=False)
    cfg = proj.analyses.CFGFast()
    
    return {
        'n_functions': len(cfg.functions),
        'avg_basic_blocks': np.mean([len(f.blocks) for f in cfg.functions.values()]),
        'avg_function_size': np.mean([f.size for f in cfg.functions.values()]),
        'cyclomatic_complexity': sum(f.cyclomatic_complexity for f in cfg.functions.values()),
        'call_depth_max': cfg.call_graph_max_depth,
    }

Instruction n-gram

import capstone

def instruction_ngrams(binary, n=3):
    md = capstone.Cs(capstone.CS_ARCH_X86, capstone.CS_MODE_64)
    instructions = []
    for insn in md.disasm(binary, 0x1000):
        instructions.append(insn.mnemonic)  # 매 mov, push, call, ...
    
    ngrams = collections.Counter()
    for i in range(len(instructions) - n + 1):
        ngrams[tuple(instructions[i:i+n])] += 1
    return ngrams

Binary embedding (SAFE-style)

import torch
import torch.nn as nn

class BinaryEmbedding(nn.Module):
    """매 instruction sequence → 매 vector."""
    def __init__(self, vocab_size=10000, dim=128):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, dim)
        self.lstm = nn.LSTM(dim, dim, batch_first=True, bidirectional=True)
        self.proj = nn.Linear(dim * 2, dim)
    
    def forward(self, instruction_ids):
        x = self.emb(instruction_ids)
        _, (h, _) = self.lstm(x)
        h = torch.cat([h[0], h[1]], dim=-1)
        return self.proj(h)

# 매 train: 매 contrastive (same author 의 close, different 의 far).

Author classifier (RF on features)

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# 매 N author 의 binary 의 collect
X, y = [], []
for author, binaries in dataset.items():
    for b in binaries:
        features = extract_features(b)
        X.append(features)
        y.append(author)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

clf = RandomForestClassifier(n_estimators=500, random_state=42)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

AI vs Human code detection

# 매 modern: GPTZero-style on code
def detect_ai_code(code, perplexity_threshold=30):
    from transformers import GPT2LMHeadModel, GPT2Tokenizer
    tok = GPT2Tokenizer.from_pretrained('gpt2')
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    
    inputs = tok(code, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs['input_ids'])
    
    perplexity = torch.exp(outputs.loss).item()
    return perplexity < perplexity_threshold  # 매 low PP = 매 AI

→ 매 unreliable (false positive 많음).

Style normalization (anti-stylometry)

def normalize_code_style(source):
    """매 stylometric leak 의 reduce."""
    # 매 indentation 의 standardize
    source = re.sub(r'\t', '    ', source)
    # 매 variable rename
    ast_tree = ast.parse(source)
    # 매 visit + rename — 매 deterministic.
    # 매 brace style normalize, etc.
    return source

Malware family clustering

from sklearn.cluster import DBSCAN

def cluster_malware(binaries):
    embeddings = [embed(b) for b in binaries]
    clusterer = DBSCAN(eps=0.3, min_samples=3, metric='cosine')
    labels = clusterer.fit_predict(embeddings)
    
    families = collections.defaultdict(list)
    for b, lbl in zip(binaries, labels):
        if lbl >= 0: families[lbl].append(b)
    return families

🤔 결정 기준

응용	Approach
Malware attribution	CFG feature + RF
Plagiarism (binary)	SAFE / Asm2Vec embedding
AI code detection	Perplexity (unreliable) + watermark
Cross-compiler	Multi-binary aggregate
Stripped binary	Decompiler + style
Privacy protect	Style normalization

기본값: 매 hand-feature + RF baseline. 매 SAFE / 매 transformer 의 SOTA.

🔗 Graph

부모: Security · Forensics · Static-Analysis-Linting
변형: Code-Stylometry · Malware-Attribution · AI-Code-Detection
응용: CFG-Analysis · Binary-Embedding · Reverse-Engineering
Adjacent: Watermarking · AI-Generated-Code-Assurance · Authenticity · Anonymous-Communication

🤖 LLM 활용

언제: 매 malware analysis. 매 plagiarism check. 매 AI code detection (cautious). 매 forensic investigation. 언제 X: 매 single binary (insufficient sample). 매 anonymous whistleblower 의 expose (ethics).

❌ 안티패턴

Single-feature reliance: 매 single signal 의 spoof.
Stripped binary 의 high confidence: 매 less info.
Closed-source 의 production claim: 매 verify X.
AI-detection (GPTZero-on-code) 의 100% trust: 매 false positive.
Anti-stylometry 의 ignore: 매 author 의 active resist.

🧪 검증 / 중복

Verified (Caliskan-Islam 2015 USENIX, SAFE 2019, Asm2Vec).
신뢰도 B (active research).
Related: AI-Generated-Code-Assurance · Authenticity · Static-Analysis-Linting · Watermarking.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — feature + ML approach + 매 angr / capstone / RF / SAFE code + AI detection

8.3 KiB Raw Blame History