--- id: wiki-2026-0508-synthetic-data title: Synthetic Data category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Synthetic Data Generation, Synthetic Dataset, 합성 데이터, Artificial Data] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [synthetic-data, data-generation, llm, gan, diffusion, privacy] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pytorch --- # Synthetic Data ## 매 한 줄 > **"매 synthetic data 는 real data 의 statistical surrogate — privacy preserve + scale 의 unlock"**. 매 2026 LLM training 의 절반 이상 synthetic (Phi-4, Llama 4, Claude). 매 GAN→Diffusion→LLM-generated 의 evolution 의 끝. 매 validation gap 의 핵심 risk — 매 model collapse 의 prevent 의 첫 priority. ## 매 핵심 ### 매 generation methods (2026) - **LLM-augmented**: Self-Instruct, Evol-Instruct, magpie, persona-based generation. 매 dominant. - **Diffusion (image/video)**: SDXL, FLUX, Sora-style. 매 image 의 standard. - **GAN**: tabular (CTGAN), face (StyleGAN3) 의 niche only — 매 retire 진행. - **Simulation**: Unreal/Unity, NVIDIA Omniverse — 매 robotics·AV 의 sim-to-real. - **Rule/template**: Faker-style, structured format (JSON, SQL) — 매 reliable baseline. - **Distillation**: teacher LLM → student dataset. 매 Phi-series approach. ### 매 use cases - **LLM training**: instruction tuning, RLHF, code (Magicoder), math (MetaMathQA). - **Privacy**: medical record (Synthea), financial (DPSDA differential privacy). - **Robotics**: sim-to-real domain randomization, AV (Waymo Carcraft). - **Edge cases**: rare disease, fraud — 매 real data 의 부족 area. - **Augmentation**: minority class oversampling, MixUp. ### 매 validation (critical) - **Fidelity**: marginal/joint distribution match (KS test, MMD, FID, KID). - **Utility**: TSTR (Train Synthetic Test Real) — downstream metric. - **Privacy**: membership inference, NN distance (DCR), k-anonymity check. - **Diversity**: coverage, mode collapse detection. ### 매 model collapse - **Definition**: synthetic-on-synthetic training 의 distribution narrow. - **Mitigation**: real data anchor (Shumailov 2024 — 1% real / 99% synthetic 의 collapse 의 stop). - **Provenance**: C2PA / watermark 의 future synthetic detection. ### 매 응용 1. **LLM instruction**: Self-Instruct + critic filter → 100k high-quality pairs. 2. **Tabular**: CTGAN / TVAE → DP-protected medical record. 3. **AV sim**: Carla / NVIDIA DRIVE Sim — millions of edge case km. 4. **Image augmentation**: SDXL controlnet → balanced classification dataset. ## 💻 패턴 ### 1. LLM Self-Instruct (2026 magpie style) ```python from anthropic import Anthropic import json client = Anthropic() def magpie_generate(seed_topics, n_per_topic=20): """Magpie: prompt LLM with empty user → it generates instruction itself.""" pairs = [] for topic in seed_topics: for _ in range(n_per_topic): # First call: model invents user prompt user_msg = client.messages.create( model="claude-opus-4-7", max_tokens=200, messages=[{"role": "user", "content": f"Topic: {topic}\n\nGenerate one user question about this topic:"}], ).content[0].text # Second call: model answers it answer = client.messages.create( model="claude-opus-4-7", max_tokens=800, messages=[{"role": "user", "content": user_msg}], ).content[0].text pairs.append({"prompt": user_msg, "completion": answer}) return pairs ``` ### 2. Evol-Instruct (depth/breadth evolution) ```python EVOLVE_PROMPT = """Rewrite the following instruction to make it more complex (add constraints, deeper reasoning, edge cases). Output only the new instruction. Original: {seed} Evolved:""" def evol(seed: str, rounds: int = 3): cur = seed for _ in range(rounds): cur = llm(EVOLVE_PROMPT.format(seed=cur)) return cur ``` ### 3. Critic filter (rejection sampling) ```python JUDGE = """Rate this instruction-response pair 1-5 on: - correctness, helpfulness, no hallucination. Output JSON {"score": int, "reason": str}. Q: {q} A: {a}""" def filter_pairs(pairs, threshold=4): keep = [] for p in pairs: verdict = json.loads(llm(JUDGE.format(q=p["prompt"], a=p["completion"]))) if verdict["score"] >= threshold: keep.append(p) return keep ``` ### 4. CTGAN tabular synthesis ```python from sdv.single_table import CTGANSynthesizer from sdv.metadata import SingleTableMetadata import pandas as pd real = pd.read_csv("medical.csv") meta = SingleTableMetadata() meta.detect_from_dataframe(real) syn = CTGANSynthesizer(meta, epochs=300, batch_size=500) syn.fit(real) fake = syn.sample(num_rows=10000) # Quality check from sdv.evaluation.single_table import evaluate_quality report = evaluate_quality(real, fake, meta) print(report.get_score()) ``` ### 5. Diffusion-based image synth (FLUX) ```python import torch from diffusers import FluxPipeline pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16 ).to("cuda") prompts = [f"medical X-ray of {cond}, clear, anonymized" for cond in conditions] images = pipe(prompts, num_inference_steps=20, guidance_scale=3.5).images ``` ### 6. TSTR validation ```python from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import roc_auc_score # Train on synthetic clf = GradientBoostingClassifier() clf.fit(syn_X, syn_y) # Test on real held-out auc = roc_auc_score(real_y_test, clf.predict_proba(real_X_test)[:, 1]) print(f"TSTR AUC: {auc:.3f}") # close to TRTR baseline → high utility ``` ### 7. Membership inference attack (privacy check) ```python import numpy as np from sklearn.neighbors import NearestNeighbors def dcr_score(real, synthetic): """Distance to Closest Record — high = better privacy.""" nn = NearestNeighbors(n_neighbors=1).fit(real) dists, _ = nn.kneighbors(synthetic) return np.mean(dists) ``` ### 8. Real-data anchor (collapse prevention) ```python def safe_mix(synthetic, real, real_ratio=0.1): """Shumailov 2024: small real anchor prevents collapse.""" n_real = int(len(synthetic) * real_ratio / (1 - real_ratio)) real_sample = real.sample(n=min(n_real, len(real))) return pd.concat([synthetic, real_sample]).sample(frac=1) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | LLM instruction tuning | Magpie + Evol + critic filter | | Tabular privacy | CTGAN + DP-SGD + DCR check | | Image augment | FLUX/SDXL + controlnet | | Robotics | Sim (Omniverse) + domain randomization | | Fast structured | Faker / template | **기본값**: LLM-generated + critic filter + real anchor (≥5%). ## 🔗 Graph - 응용: [[Privacy-Preserving-ML]] - Adjacent: [[Differential-Privacy]] · [[Model-Collapse]] · [[Data-Augmentation]] ## 🤖 LLM 활용 **언제**: instruction generation (Self-Instruct), critic judging, edge case ideation. **언제 X**: privacy-sensitive numeric synth (LLM 의 number 의 hallucinate — CTGAN/DP method 사용). ## ❌ 안티패턴 - **Real data anchor 없 synthetic-only training**: 매 model collapse — distribution narrow. - **Validation skip**: 매 unsafe deploy. TSTR / FID / DCR 의 minimum 3 metric. - **Privacy claim without DP**: 매 pure synthetic ≠ private — membership inference 의 leak. - **Single-method generation**: 매 mode-collapse risk. ensemble / diversity check. - **Watermark / provenance 무시**: 매 future detection 의 impossible — C2PA 의 attach. ## 🧪 검증 / 중복 - Verified (Shumailov "AI models collapse" Nature 2024, Magpie paper 2024, Microsoft Phi-4 tech report 2025, NIST SP 800-188). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — synthetic data canonical (LLM-generated + GAN + diffusion + collapse) |