Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

8.1 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Synthetic Data

매 한 줄

"매 synthetic data 는 real data 의 statistical surrogate — privacy preserve + scale 의 unlock". 매 2026 LLM training 의 절반 이상 synthetic (Phi-4, Llama 4, Claude). 매 GAN→Diffusion→LLM-generated 의 evolution 의 끝. 매 validation gap 의 핵심 risk — 매 model collapse 의 prevent 의 첫 priority.

매 핵심

매 generation methods (2026)

LLM-augmented: Self-Instruct, Evol-Instruct, magpie, persona-based generation. 매 dominant.
Diffusion (image/video): SDXL, FLUX, Sora-style. 매 image 의 standard.
GAN: tabular (CTGAN), face (StyleGAN3) 의 niche only — 매 retire 진행.
Simulation: Unreal/Unity, NVIDIA Omniverse — 매 robotics·AV 의 sim-to-real.
Rule/template: Faker-style, structured format (JSON, SQL) — 매 reliable baseline.
Distillation: teacher LLM → student dataset. 매 Phi-series approach.

매 use cases

LLM training: instruction tuning, RLHF, code (Magicoder), math (MetaMathQA).
Privacy: medical record (Synthea), financial (DPSDA differential privacy).
Robotics: sim-to-real domain randomization, AV (Waymo Carcraft).
Edge cases: rare disease, fraud — 매 real data 의 부족 area.
Augmentation: minority class oversampling, MixUp.

매 validation (critical)

Fidelity: marginal/joint distribution match (KS test, MMD, FID, KID).
Utility: TSTR (Train Synthetic Test Real) — downstream metric.
Privacy: membership inference, NN distance (DCR), k-anonymity check.
Diversity: coverage, mode collapse detection.

매 model collapse

Definition: synthetic-on-synthetic training 의 distribution narrow.
Mitigation: real data anchor (Shumailov 2024 — 1% real / 99% synthetic 의 collapse 의 stop).
Provenance: C2PA / watermark 의 future synthetic detection.

매 응용

LLM instruction: Self-Instruct + critic filter → 100k high-quality pairs.
Tabular: CTGAN / TVAE → DP-protected medical record.
AV sim: Carla / NVIDIA DRIVE Sim — millions of edge case km.
Image augmentation: SDXL controlnet → balanced classification dataset.

💻 패턴

1. LLM Self-Instruct (2026 magpie style)

from anthropic import Anthropic
import json

client = Anthropic()

def magpie_generate(seed_topics, n_per_topic=20):
    """Magpie: prompt LLM with empty user → it generates instruction itself."""
    pairs = []
    for topic in seed_topics:
        for _ in range(n_per_topic):
            # First call: model invents user prompt
            user_msg = client.messages.create(
                model="claude-opus-4-7",
                max_tokens=200,
                messages=[{"role": "user", "content": f"Topic: {topic}\n\nGenerate one user question about this topic:"}],
            ).content[0].text

            # Second call: model answers it
            answer = client.messages.create(
                model="claude-opus-4-7",
                max_tokens=800,
                messages=[{"role": "user", "content": user_msg}],
            ).content[0].text

            pairs.append({"prompt": user_msg, "completion": answer})
    return pairs

2. Evol-Instruct (depth/breadth evolution)

EVOLVE_PROMPT = """Rewrite the following instruction to make it more complex
(add constraints, deeper reasoning, edge cases). Output only the new instruction.

Original: {seed}
Evolved:"""

def evol(seed: str, rounds: int = 3):
    cur = seed
    for _ in range(rounds):
        cur = llm(EVOLVE_PROMPT.format(seed=cur))
    return cur

3. Critic filter (rejection sampling)

JUDGE = """Rate this instruction-response pair 1-5 on:
- correctness, helpfulness, no hallucination.
Output JSON {"score": int, "reason": str}.

Q: {q}
A: {a}"""

def filter_pairs(pairs, threshold=4):
    keep = []
    for p in pairs:
        verdict = json.loads(llm(JUDGE.format(q=p["prompt"], a=p["completion"])))
        if verdict["score"] >= threshold:
            keep.append(p)
    return keep

4. CTGAN tabular synthesis

from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata
import pandas as pd

real = pd.read_csv("medical.csv")
meta = SingleTableMetadata()
meta.detect_from_dataframe(real)

syn = CTGANSynthesizer(meta, epochs=300, batch_size=500)
syn.fit(real)
fake = syn.sample(num_rows=10000)

# Quality check
from sdv.evaluation.single_table import evaluate_quality
report = evaluate_quality(real, fake, meta)
print(report.get_score())

5. Diffusion-based image synth (FLUX)

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
).to("cuda")

prompts = [f"medical X-ray of {cond}, clear, anonymized" for cond in conditions]
images = pipe(prompts, num_inference_steps=20, guidance_scale=3.5).images

6. TSTR validation

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

# Train on synthetic
clf = GradientBoostingClassifier()
clf.fit(syn_X, syn_y)

# Test on real held-out
auc = roc_auc_score(real_y_test, clf.predict_proba(real_X_test)[:, 1])
print(f"TSTR AUC: {auc:.3f}")  # close to TRTR baseline → high utility

7. Membership inference attack (privacy check)

import numpy as np
from sklearn.neighbors import NearestNeighbors

def dcr_score(real, synthetic):
    """Distance to Closest Record — high = better privacy."""
    nn = NearestNeighbors(n_neighbors=1).fit(real)
    dists, _ = nn.kneighbors(synthetic)
    return np.mean(dists)

8. Real-data anchor (collapse prevention)

def safe_mix(synthetic, real, real_ratio=0.1):
    """Shumailov 2024: small real anchor prevents collapse."""
    n_real = int(len(synthetic) * real_ratio / (1 - real_ratio))
    real_sample = real.sample(n=min(n_real, len(real)))
    return pd.concat([synthetic, real_sample]).sample(frac=1)

매 결정 기준

상황	Approach
LLM instruction tuning	Magpie + Evol + critic filter
Tabular privacy	CTGAN + DP-SGD + DCR check
Image augment	FLUX/SDXL + controlnet
Robotics	Sim (Omniverse) + domain randomization
Fast structured	Faker / template

기본값: LLM-generated + critic filter + real anchor (≥5%).

🔗 Graph

부모: Data-Generation · Machine-Learning-Data
변형: Self-Instruct · Evol-Instruct · CTGAN
응용: LLM-Training · Privacy-Preserving-ML · Sim-to-Real
Adjacent: Differential-Privacy · Model-Collapse · Data-Augmentation

🤖 LLM 활용

언제: instruction generation (Self-Instruct), critic judging, edge case ideation. 언제 X: privacy-sensitive numeric synth (LLM 의 number 의 hallucinate — CTGAN/DP method 사용).

❌ 안티패턴

Real data anchor 없 synthetic-only training: 매 model collapse — distribution narrow.
Validation skip: 매 unsafe deploy. TSTR / FID / DCR 의 minimum 3 metric.
Privacy claim without DP: 매 pure synthetic ≠ private — membership inference 의 leak.
Single-method generation: 매 mode-collapse risk. ensemble / diversity check.
Watermark / provenance 무시: 매 future detection 의 impossible — C2PA 의 attach.

🧪 검증 / 중복

Verified (Shumailov "AI models collapse" Nature 2024, Magpie paper 2024, Microsoft Phi-4 tech report 2025, NIST SP 800-188).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — synthetic data canonical (LLM-generated + GAN + diffusion + collapse)

8.1 KiB Raw Blame History