Files
2nd/10_Wiki/Topics/Coding/AI_Synthetic_Data.md
T
2026-05-09 22:47:42 +09:00

9.8 KiB
Raw Blame History

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-synthetic-data Synthetic Data — LLM 으로 train / test / fixture Coding draft B conceptual 2026-05-09 2026-05-09
ai
synthetic-data
vibe-coding
language applicable_to
TS / Python
Backend
synthetic data
LLM-generated data
test fixtures
data augmentation
anonymization

Synthetic Data

LLM 가 fake data 생성. Test fixture, ML training, 사용자 demo, anonymization. Real data privacy / cost / scale 우회.

📖 핵심 개념

  • Generation: LLM 가 schema 따라 data 생성.
  • Augmentation: 기존 data 의 변형.
  • Anonymization: PII 제거 + realistic 유지.
  • Distillation: 큰 model → 작은 model 의 training.

💻 코드 패턴

LLM 으로 fixture 생성

import { z } from 'zod';
import OpenAI from 'openai';
import { zodResponseFormat } from 'openai/helpers/zod';

const User = z.object({
  email: z.string().email(),
  name: z.string(),
  bio: z.string().max(200),
  interests: z.array(z.string()).max(5),
  age: z.number().int().min(18).max(80),
});

async function generateUsers(count: number): Promise<z.infer<typeof User>[]> {
  const r = await openai.beta.chat.completions.parse({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'Generate diverse, realistic test user profiles. Vary demographics, names, bios.' },
      { role: 'user', content: `Generate ${count} users.` },
    ],
    response_format: zodResponseFormat(z.object({ users: z.array(User) }), 'users'),
  });
  return r.choices[0].message.parsed!.users;
}

const users = await generateUsers(50);

→ Faker.js 보다 realistic.

Diverse generation

// 단순 — 비슷한 데이터 자주
// Better — diversity prompt

const prompts = [
  'Generate users from different countries',
  'Generate users with different age groups',
  'Generate users with different income levels',
];

const all: User[] = [];
for (const prompt of prompts) {
  const batch = await generateWithPrompt(prompt, 20);
  all.push(...batch);
}

Schema-driven (any)

const Order = z.object({
  id: z.string().uuid(),
  userId: z.string().uuid(),
  items: z.array(z.object({
    productId: z.string().uuid(),
    quantity: z.number().int().positive(),
    price: z.number().positive(),
  })).min(1).max(10),
  status: z.enum(['pending', 'paid', 'shipped', 'delivered', 'cancelled']),
  createdAt: z.string().datetime(),
});

const orders = await generateFromSchema(Order, 100);

Faker.js (deterministic, fast)

import { faker } from '@faker-js/faker';

faker.seed(42);  // deterministic

const user = {
  id: faker.string.uuid(),
  name: faker.person.fullName(),
  email: faker.internet.email(),
  address: {
    street: faker.location.streetAddress(),
    city: faker.location.city(),
    zip: faker.location.zipCode(),
  },
};

→ 빠름, 일관, but 패턴 명확 (LLM 보다 less realistic).

Hybrid (Faker + LLM)

// Faker = structure (id, email, address)
// LLM = creative (bio, review text)

const user = {
  id: faker.string.uuid(),
  email: faker.internet.email(),
  bio: await llm.generate('Write a 100-character bio for a freelance designer'),
  reviews: await llm.generate('Write 3 realistic product reviews'),
};

Test database seed

async function seed() {
  await db.user.deleteMany();
  await db.order.deleteMany();
  
  const users = await generateUsers(100);
  await db.user.createMany({ data: users });
  
  const orders = await generateOrders(500, users.map(u => u.id));
  await db.order.createMany({ data: orders });
  
  console.log(`Seeded ${users.length} users, ${orders.length} orders`);
}
yarn seed

→ Test environment 가 production-like.

Anonymization (real → synthetic)

// Real user data → similar but anonymized
async function anonymize(user: User): Promise<User> {
  const r = await llm.complete({
    system: 'Generate a realistic user profile similar to this one but with all PII changed.',
    user: `Original: ${JSON.stringify(user)}`,
    response_format: { type: 'json_object' },
  });
  return JSON.parse(r);
}

// Or simpler — Faker
function anonymize(user: User): User {
  return {
    ...user,
    name: faker.person.fullName(),
    email: faker.internet.email(),
    phone: faker.phone.number(),
    // 비-PII keep (purchase history, preferences)
  };
}

→ Test on prod-like data without exposure.

ML training data augmentation

// Few-shot examples → 더 많은 generation
async function augmentDataset(examples: Example[], targetSize: number) {
  const augmented: Example[] = [...examples];
  
  while (augmented.length < targetSize) {
    const batch = await llm.generate({
      system: 'Generate similar examples to these, with variations.',
      user: examples.slice(0, 5).map(e => JSON.stringify(e)).join('\n'),
      response_format: { type: 'json_object' },
    });
    augmented.push(...JSON.parse(batch).examples);
  }
  
  return augmented.slice(0, targetSize);
}

→ 100 examples → 1000.

Distillation (big → small model)

// 1. Big model (GPT-4o) 가 답 생성
// 2. (input, output) 쌍 = training data
// 3. Small model (Llama 8B) fine-tune

async function generateTrainingData(inputs: string[]) {
  const data = [];
  for (const input of inputs) {
    const output = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: input }],
    });
    data.push({ input, output: output.choices[0].message.content });
  }
  return data;
}

// 그 후 fine-tune small model.

→ Cost ↓ runtime, 비슷 quality.

Edge case generation

async function generateEdgeCases(schema: any, count: number) {
  return await llm.generate({
    system: `Generate edge case test inputs based on this schema.
Include: empty, very long, special chars, boundary values, unicode, malformed.`,
    user: JSON.stringify(schema),
    response_format: { type: 'json_object' },
  });
}

Adversarial (security test)

async function generateAdversarial(target: string, count: number) {
  return await llm.generate({
    system: `Generate adversarial inputs for security testing.
Include: SQL injection attempts, XSS, command injection, long strings, unicode tricks.`,
    user: `Target: ${target}`,
  });
}

→ Pen testing.

Validation (synthetic 가 real 같은가?)

// Statistical check
const realStats = computeStats(realData);
const synthStats = computeStats(syntheticData);

// Distribution similarity (KS test, etc)
expect(ksDistance(realStats, synthStats)).toBeLessThan(0.1);

Privacy guarantee

GDPR / HIPAA:
- Synthetic data 가 individual 추적 불가
- Differential privacy 가 강한 보장

Tools:
- gretel.ai
- Mostly AI
- YData

Use cases

✅ Test fixtures (unit / integration / e2e)
✅ Demo / sandbox
✅ Load test data
✅ ML training augmentation
✅ Privacy-preserving sharing
✅ Edge case generation
✅ Adversarial testing

❌ Production data 대체 (real distribution 다름)
❌ Statistical analysis (bias)

LLM-as-judge (synthetic 검증)

async function evaluateSynthetic(real: any[], synthetic: any[]) {
  return await llm.complete({
    user: `Compare these two datasets:
Real: ${JSON.stringify(real.slice(0, 10))}
Synthetic: ${JSON.stringify(synthetic.slice(0, 10))}

Are they similar in style, distribution, realism? Score 1-10. Output JSON.`,
    response_format: { type: 'json_object' },
  });
}

Cost

1000 records × 100 tokens × $5/1M = $0.50

→ Cheap.

ML training data:
10K records × 500 tokens × $5/1M = $25

→ Still cheap vs human labeling.

Reproducibility

// Seed
const seed = 42;
faker.seed(seed);

// LLM = non-deterministic. Use temperature 0 + cache.
const r = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  temperature: 0,
  seed: 42,  // 일부 model
  messages: [...],
});

Volume

// 10K records — batch
const BATCH = 50;
const total = 10000;

const all: any[] = [];
for (let i = 0; i < total; i += BATCH) {
  const batch = await generate(BATCH);
  all.push(...batch);
  console.log(`${all.length}/${total}`);
}

→ Rate limit / cost 주의.

Streaming (large dataset)

async function* generateStream(count: number) {
  for (let i = 0; i < count; i += 50) {
    const batch = await generate(Math.min(50, count - i));
    for (const item of batch) yield item;
  }
}

for await (const item of generateStream(10000)) {
  await db.insert(item);
}

Tools

- Mockaroo (web): schema → CSV/JSON
- Faker.js / Faker (Python)
- gretel.ai: privacy-preserving synthetic
- SDV (Synthetic Data Vault): tabular ML
- LLM (GPT-4o, Claude, local)

Best practices

1. Schema first (Zod / Pydantic)
2. Diverse prompts (variation)
3. Validation 가 real distribution 비슷
4. Privacy 검증 (no PII leak)
5. Versioning (synthetic dataset 도)
6. Cost monitoring

🤔 의사결정 기준

사용 추천
Unit test Faker (deterministic)
E2E test Faker + LLM 조합
Demo / sandbox LLM (realistic)
ML training LLM + augmentation
Privacy 보존 gretel / Mostly AI
큰 volume Faker (cost)

안티패턴

  • Real PII 변형 X — synthetic 가정: privacy violation.
  • 모든 거 LLM (큰 cost): Faker 가 OK 자주.
  • Distribution 가 real 같은 가정: validate.
  • Reproducibility 없음: test flake.
  • Seed 없음 (random): 다른 결과.
  • Edge case 없음: 일반 case 만 generate.
  • Synthetic만 deploy production: real 가 아님.

🤖 LLM 활용 힌트

  • Schema-driven (Zod) + LLM = realistic.
  • Faker (cheap) + LLM (creative) hybrid.
  • Diverse prompt (multiple variation).
  • Privacy-aware (no PII generation).

🔗 관련 문서