"매 synthetic data 는 real data 의 statistical surrogate — privacy preserve + scale 의 unlock". 매 2026 LLM training 의 절반 이상 synthetic (Phi-4, Llama 4, Claude). 매 GAN→Diffusion→LLM-generated 의 evolution 의 끝. 매 validation gap 의 핵심 risk — 매 model collapse 의 prevent 의 첫 priority.
매 핵심
매 generation methods (2026)
LLM-augmented: Self-Instruct, Evol-Instruct, magpie, persona-based generation. 매 dominant.
Diffusion (image/video): SDXL, FLUX, Sora-style. 매 image 의 standard.
GAN: tabular (CTGAN), face (StyleGAN3) 의 niche only — 매 retire 진행.
Simulation: Unreal/Unity, NVIDIA Omniverse — 매 robotics·AV 의 sim-to-real.
Rule/template: Faker-style, structured format (JSON, SQL) — 매 reliable baseline.
Distillation: teacher LLM → student dataset. 매 Phi-series approach.
매 use cases
LLM training: instruction tuning, RLHF, code (Magicoder), math (MetaMathQA).
Privacy: medical record (Synthea), financial (DPSDA differential privacy).
Robotics: sim-to-real domain randomization, AV (Waymo Carcraft).
Edge cases: rare disease, fraud — 매 real data 의 부족 area.
Augmentation: minority class oversampling, MixUp.
매 validation (critical)
Fidelity: marginal/joint distribution match (KS test, MMD, FID, KID).
Utility: TSTR (Train Synthetic Test Real) — downstream metric.
Privacy: membership inference, NN distance (DCR), k-anonymity check.
Diversity: coverage, mode collapse detection.
매 model collapse
Definition: synthetic-on-synthetic training 의 distribution narrow.
Mitigation: real data anchor (Shumailov 2024 — 1% real / 99% synthetic 의 collapse 의 stop).
Provenance: C2PA / watermark 의 future synthetic detection.
fromanthropicimportAnthropicimportjsonclient=Anthropic()defmagpie_generate(seed_topics,n_per_topic=20):"""Magpie: prompt LLM with empty user → it generates instruction itself."""pairs=[]fortopicinseed_topics:for_inrange(n_per_topic):# First call: model invents user promptuser_msg=client.messages.create(model="claude-opus-4-7",max_tokens=200,messages=[{"role":"user","content":f"Topic: {topic}\n\nGenerate one user question about this topic:"}],).content[0].text# Second call: model answers itanswer=client.messages.create(model="claude-opus-4-7",max_tokens=800,messages=[{"role":"user","content":user_msg}],).content[0].textpairs.append({"prompt":user_msg,"completion":answer})returnpairs
2. Evol-Instruct (depth/breadth evolution)
EVOLVE_PROMPT="""Rewrite the following instruction to make it more complex
(add constraints, deeper reasoning, edge cases). Output only the new instruction.
Original: {seed}Evolved:"""defevol(seed:str,rounds:int=3):cur=seedfor_inrange(rounds):cur=llm(EVOLVE_PROMPT.format(seed=cur))returncur
3. Critic filter (rejection sampling)
JUDGE="""Rate this instruction-response pair 1-5 on:
- correctness, helpfulness, no hallucination.
Output JSON {"score": int, "reason": str}.
Q: {q}A: {a}"""deffilter_pairs(pairs,threshold=4):keep=[]forpinpairs:verdict=json.loads(llm(JUDGE.format(q=p["prompt"],a=p["completion"])))ifverdict["score"]>=threshold:keep.append(p)returnkeep
importtorchfromdiffusersimportFluxPipelinepipe=FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",torch_dtype=torch.bfloat16).to("cuda")prompts=[f"medical X-ray of {cond}, clear, anonymized"forcondinconditions]images=pipe(prompts,num_inference_steps=20,guidance_scale=3.5).images
6. TSTR validation
fromsklearn.ensembleimportGradientBoostingClassifierfromsklearn.metricsimportroc_auc_score# Train on syntheticclf=GradientBoostingClassifier()clf.fit(syn_X,syn_y)# Test on real held-outauc=roc_auc_score(real_y_test,clf.predict_proba(real_X_test)[:,1])print(f"TSTR AUC: {auc:.3f}")# close to TRTR baseline → high utility
7. Membership inference attack (privacy check)
importnumpyasnpfromsklearn.neighborsimportNearestNeighborsdefdcr_score(real,synthetic):"""Distance to Closest Record — high = better privacy."""nn=NearestNeighbors(n_neighbors=1).fit(real)dists,_=nn.kneighbors(synthetic)returnnp.mean(dists)
8. Real-data anchor (collapse prevention)
defsafe_mix(synthetic,real,real_ratio=0.1):"""Shumailov 2024: small real anchor prevents collapse."""n_real=int(len(synthetic)*real_ratio/(1-real_ratio))real_sample=real.sample(n=min(n_real,len(real)))returnpd.concat([synthetic,real_sample]).sample(frac=1)
매 결정 기준
상황
Approach
LLM instruction tuning
Magpie + Evol + critic filter
Tabular privacy
CTGAN + DP-SGD + DCR check
Image augment
FLUX/SDXL + controlnet
Robotics
Sim (Omniverse) + domain randomization
Fast structured
Faker / template
기본값: LLM-generated + critic filter + real anchor (≥5%).
언제: instruction generation (Self-Instruct), critic judging, edge case ideation.
언제 X: privacy-sensitive numeric synth (LLM 의 number 의 hallucinate — CTGAN/DP method 사용).
❌ 안티패턴
Real data anchor 없 synthetic-only training: 매 model collapse — distribution narrow.
Validation skip: 매 unsafe deploy. TSTR / FID / DCR 의 minimum 3 metric.
Privacy claim without DP: 매 pure synthetic ≠ private — membership inference 의 leak.
Single-method generation: 매 mode-collapse risk. ensemble / diversity check.
Watermark / provenance 무시: 매 future detection 의 impossible — C2PA 의 attach.
🧪 검증 / 중복
Verified (Shumailov "AI models collapse" Nature 2024, Magpie paper 2024, Microsoft Phi-4 tech report 2025, NIST SP 800-188).
신뢰도 A.
🕓 Changelog
날짜
변경
2026-05-08
Phase 1
2026-05-10
Manual cleanup — synthetic data canonical (LLM-generated + GAN + diffusion + collapse)