Files
2nd/10_Wiki/Topics/AI_and_ML/Synthetic-Data.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

7.9 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-synthetic-data Synthetic Data 10_Wiki/Topics verified self
Synthetic Data Generation
Synthetic Dataset
합성 데이터
Artificial Data
none A 0.95 applied
synthetic-data
data-generation
llm
gan
diffusion
privacy
2026-05-10 pending
language framework
python pytorch

Synthetic Data

매 한 줄

"매 synthetic data 는 real data 의 statistical surrogate — privacy preserve + scale 의 unlock". 매 2026 LLM training 의 절반 이상 synthetic (Phi-4, Llama 4, Claude). 매 GAN→Diffusion→LLM-generated 의 evolution 의 끝. 매 validation gap 의 핵심 risk — 매 model collapse 의 prevent 의 첫 priority.

매 핵심

매 generation methods (2026)

  • LLM-augmented: Self-Instruct, Evol-Instruct, magpie, persona-based generation. 매 dominant.
  • Diffusion (image/video): SDXL, FLUX, Sora-style. 매 image 의 standard.
  • GAN: tabular (CTGAN), face (StyleGAN3) 의 niche only — 매 retire 진행.
  • Simulation: Unreal/Unity, NVIDIA Omniverse — 매 robotics·AV 의 sim-to-real.
  • Rule/template: Faker-style, structured format (JSON, SQL) — 매 reliable baseline.
  • Distillation: teacher LLM → student dataset. 매 Phi-series approach.

매 use cases

  • LLM training: instruction tuning, RLHF, code (Magicoder), math (MetaMathQA).
  • Privacy: medical record (Synthea), financial (DPSDA differential privacy).
  • Robotics: sim-to-real domain randomization, AV (Waymo Carcraft).
  • Edge cases: rare disease, fraud — 매 real data 의 부족 area.
  • Augmentation: minority class oversampling, MixUp.

매 validation (critical)

  • Fidelity: marginal/joint distribution match (KS test, MMD, FID, KID).
  • Utility: TSTR (Train Synthetic Test Real) — downstream metric.
  • Privacy: membership inference, NN distance (DCR), k-anonymity check.
  • Diversity: coverage, mode collapse detection.

매 model collapse

  • Definition: synthetic-on-synthetic training 의 distribution narrow.
  • Mitigation: real data anchor (Shumailov 2024 — 1% real / 99% synthetic 의 collapse 의 stop).
  • Provenance: C2PA / watermark 의 future synthetic detection.

매 응용

  1. LLM instruction: Self-Instruct + critic filter → 100k high-quality pairs.
  2. Tabular: CTGAN / TVAE → DP-protected medical record.
  3. AV sim: Carla / NVIDIA DRIVE Sim — millions of edge case km.
  4. Image augmentation: SDXL controlnet → balanced classification dataset.

💻 패턴

1. LLM Self-Instruct (2026 magpie style)

from anthropic import Anthropic
import json

client = Anthropic()

def magpie_generate(seed_topics, n_per_topic=20):
    """Magpie: prompt LLM with empty user → it generates instruction itself."""
    pairs = []
    for topic in seed_topics:
        for _ in range(n_per_topic):
            # First call: model invents user prompt
            user_msg = client.messages.create(
                model="claude-opus-4-7",
                max_tokens=200,
                messages=[{"role": "user", "content": f"Topic: {topic}\n\nGenerate one user question about this topic:"}],
            ).content[0].text

            # Second call: model answers it
            answer = client.messages.create(
                model="claude-opus-4-7",
                max_tokens=800,
                messages=[{"role": "user", "content": user_msg}],
            ).content[0].text

            pairs.append({"prompt": user_msg, "completion": answer})
    return pairs

2. Evol-Instruct (depth/breadth evolution)

EVOLVE_PROMPT = """Rewrite the following instruction to make it more complex
(add constraints, deeper reasoning, edge cases). Output only the new instruction.

Original: {seed}
Evolved:"""

def evol(seed: str, rounds: int = 3):
    cur = seed
    for _ in range(rounds):
        cur = llm(EVOLVE_PROMPT.format(seed=cur))
    return cur

3. Critic filter (rejection sampling)

JUDGE = """Rate this instruction-response pair 1-5 on:
- correctness, helpfulness, no hallucination.
Output JSON {"score": int, "reason": str}.

Q: {q}
A: {a}"""

def filter_pairs(pairs, threshold=4):
    keep = []
    for p in pairs:
        verdict = json.loads(llm(JUDGE.format(q=p["prompt"], a=p["completion"])))
        if verdict["score"] >= threshold:
            keep.append(p)
    return keep

4. CTGAN tabular synthesis

from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata
import pandas as pd

real = pd.read_csv("medical.csv")
meta = SingleTableMetadata()
meta.detect_from_dataframe(real)

syn = CTGANSynthesizer(meta, epochs=300, batch_size=500)
syn.fit(real)
fake = syn.sample(num_rows=10000)

# Quality check
from sdv.evaluation.single_table import evaluate_quality
report = evaluate_quality(real, fake, meta)
print(report.get_score())

5. Diffusion-based image synth (FLUX)

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
).to("cuda")

prompts = [f"medical X-ray of {cond}, clear, anonymized" for cond in conditions]
images = pipe(prompts, num_inference_steps=20, guidance_scale=3.5).images

6. TSTR validation

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

# Train on synthetic
clf = GradientBoostingClassifier()
clf.fit(syn_X, syn_y)

# Test on real held-out
auc = roc_auc_score(real_y_test, clf.predict_proba(real_X_test)[:, 1])
print(f"TSTR AUC: {auc:.3f}")  # close to TRTR baseline → high utility

7. Membership inference attack (privacy check)

import numpy as np
from sklearn.neighbors import NearestNeighbors

def dcr_score(real, synthetic):
    """Distance to Closest Record — high = better privacy."""
    nn = NearestNeighbors(n_neighbors=1).fit(real)
    dists, _ = nn.kneighbors(synthetic)
    return np.mean(dists)

8. Real-data anchor (collapse prevention)

def safe_mix(synthetic, real, real_ratio=0.1):
    """Shumailov 2024: small real anchor prevents collapse."""
    n_real = int(len(synthetic) * real_ratio / (1 - real_ratio))
    real_sample = real.sample(n=min(n_real, len(real)))
    return pd.concat([synthetic, real_sample]).sample(frac=1)

매 결정 기준

상황 Approach
LLM instruction tuning Magpie + Evol + critic filter
Tabular privacy CTGAN + DP-SGD + DCR check
Image augment FLUX/SDXL + controlnet
Robotics Sim (Omniverse) + domain randomization
Fast structured Faker / template

기본값: LLM-generated + critic filter + real anchor (≥5%).

🔗 Graph

🤖 LLM 활용

언제: instruction generation (Self-Instruct), critic judging, edge case ideation. 언제 X: privacy-sensitive numeric synth (LLM 의 number 의 hallucinate — CTGAN/DP method 사용).

안티패턴

  • Real data anchor 없 synthetic-only training: 매 model collapse — distribution narrow.
  • Validation skip: 매 unsafe deploy. TSTR / FID / DCR 의 minimum 3 metric.
  • Privacy claim without DP: 매 pure synthetic ≠ private — membership inference 의 leak.
  • Single-method generation: 매 mode-collapse risk. ensemble / diversity check.
  • Watermark / provenance 무시: 매 future detection 의 impossible — C2PA 의 attach.

🧪 검증 / 중복

  • Verified (Shumailov "AI models collapse" Nature 2024, Magpie paper 2024, Microsoft Phi-4 tech report 2025, NIST SP 800-188).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — synthetic data canonical (LLM-generated + GAN + diffusion + collapse)