Files
2nd/10_Wiki/Topics/AI_and_ML/Data-Flywheel-Effect.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

8.0 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-data-flywheel Data Flywheel Effect 10_Wiki/Topics verified self
data flywheel
network effect
data moat
AI moat
defensibility
cold start
none B 0.88 applied
business-strategy
data-flywheel
moat
network-effect
ai-strategy
cold-start
defensibility
2026-05-10 pending
language applicable_to
business strategy
AI Product Strategy
Defensibility
Growth

Data Flywheel Effect

매 한 줄

"매 model → 매 product → 매 user → 매 data → 매 better model". 매 AI 의 defensible moat 의 source. 매 cold start 의 hardest. 매 modern: 매 LLM 시대 의 quality flywheel (RLHF, user feedback). 매 critique: 매 quantity ≠ moat.

매 핵심 cycle

  1. Better model.
  2. Better product / UX.
  3. More users.
  4. More data (interaction).
  5. Model improvement → 매 1.

매 conditions for flywheel

  • Network effect of data: 매 user 1 → user 2 의 benefit.
  • Reinvestment: 매 data → 매 model improvement loop.
  • Speed: 매 cycle 의 cycle 의 빠름.
  • Quality matters: 매 noise 의 ↑ 의 model 의 degrade.

매 examples

Strong flywheel

  • Google Search: 매 click → 매 ranking.
  • Tesla FSD: 매 mile → 매 model.
  • Spotify: 매 listen → 매 recommend.
  • Waze: 매 traffic → 매 routing.
  • Duolingo: 매 mistake → 매 SRS.

Weak / Failed

  • Many startup AI: 매 data 의 collect 가 매 use X.
  • Generic chatbot: 매 user feedback X.

매 moat strength factor

  • Data exclusivity: 매 own only.
  • Data quality: 매 noise filter.
  • Data freshness: 매 update speed.
  • Network density: 매 user 의 interaction.
  • Switching cost: 매 lock-in.
  • Privacy compliance: 매 GDPR.

매 cold start strategy

  1. Hand-curate: 매 first 1000 user 의 manually.
  2. Synthetic data: 매 simulate.
  3. Open data: 매 Wikipedia, 매 CommonCrawl.
  4. Acquisition: 매 dataset 의 buy.
  5. Lighthouse customer: 매 large customer 의 data.
  6. Product-led growth: 매 free tier.

매 modern (LLM era)

  • RLHF: 매 user preference 의 collect.
  • Implicit feedback: 매 thumbs up / down, 매 dwell time.
  • A/B: 매 model variant.
  • User correction: 매 manual edit.

매 risks

  • Bias amplification: 매 own user 의 bias 의 reinforce.
  • Echo chamber: 매 narrow.
  • Privacy: 매 PII.
  • Regulatory: 매 EU AI Act.
  • Model collapse: 매 synthetic training.

매 critique

  • "Data is not the new oil — it's the new sand." (cheap, abundant)
  • 매 LLM era 의 base model 의 commoditize.
  • 매 quality > quantity.
  • 매 application-layer 의 differentiate.

💻 패턴

Flywheel measurement

def flywheel_health(metrics):
    return {
        'data_growth_rate': (metrics.data_now - metrics.data_year_ago) / metrics.data_year_ago,
        'model_improvement_rate': (metrics.eval_now - metrics.eval_year_ago) / metrics.eval_year_ago,
        'user_growth_rate': metrics.users_now / metrics.users_year_ago,
        'data_per_user': metrics.data_now / metrics.users_now,
        'feedback_rate': metrics.feedback_count / metrics.user_interaction_count,
    }

Implicit feedback collection

def collect_implicit_feedback(user_id, response_id, signal_type, value):
    """매 dwell time, scroll depth, copy, share."""
    db.feedback.insert({
        'user_id': user_id,
        'response_id': response_id,
        'signal': signal_type,  # 매 'dwell', 'copy', 'share', 'edit'
        'value': value,
        'timestamp': datetime.now(),
    })

# 매 매 dwell > 30 sec → 매 positive signal.

RLHF data pipeline

def rlhf_pipeline():
    # 매 1. user interaction
    interactions = collect_interactions()
    
    # 매 2. preference pair generation
    pairs = []
    for i in interactions:
        if i.has_thumbs_up_and_down_in_session:
            pairs.append({
                'prompt': i.prompt,
                'chosen': i.thumbs_up_response,
                'rejected': i.thumbs_down_response,
            })
    
    # 매 3. quality filter
    pairs = filter_quality(pairs)
    
    # 매 4. DPO / RLHF train
    train_dpo(pairs)
    
    # 매 5. shadow deploy
    shadow_test_new_model()
    
    # 매 6. gradual rollout
    canary_deploy(percentage=5)

Cold start: synthetic data

def bootstrap_cold_start(use_case, n=1000):
    """매 synthetic data 의 first model 의 train."""
    examples = []
    for _ in range(n):
        seed = generate_seed_for(use_case)
        synthetic = llm.generate(f"""Generate a realistic example for: {use_case}
Input: ...
Expected output: ...""")
        examples.append(synthetic)
    return examples

A/B test (model improvement signal)

def ab_test_model(model_old, model_new, traffic_pct=10):
    def assign(user_id):
        return 'new' if hash(user_id) % 100 < traffic_pct else 'old'
    
    metrics = collect_metrics_by_variant(assign)
    if statistical_significance(metrics) and metrics['new'] > metrics['old']:
        promote(model_new)

Data quality scoring

def score_training_example(example, base_model):
    """매 매 example 의 quality 의 estimate."""
    score = 0
    score += has_diverse_vocab(example) * 0.2
    score += not_repetitive(example) * 0.2
    score += factually_consistent(example) * 0.3
    score += task_clarity(example) * 0.3
    return score

# 매 top-K 의 select for training.

Privacy-preserving learning

# 매 federated learning
def federated_update(global_model, client_data_chunks):
    local_updates = []
    for client_chunk in client_data_chunks:
        local_model = global_model.copy()
        local_model.train(client_chunk)
        local_updates.append(local_model.weights - global_model.weights)
    
    # 매 average update only — 매 raw data 의 leave 의 X
    global_model.weights += avg(local_updates)
    return global_model

Defensibility audit

def defensibility_score(metrics):
    score = 0
    if metrics.proprietary_data_exclusivity: score += 3
    if metrics.user_lock_in > 0.5: score += 2
    if metrics.network_density > 0.7: score += 2
    if metrics.data_quality_unique: score += 2
    if metrics.regulatory_barrier: score += 1
    return f'Moat strength: {score}/10'

매 결정 기준

상황 Strategy
Cold start Synthetic + open data + lighthouse customer
Growing Implicit feedback + A/B
Scale RLHF + automation
Sensitive Federated + DP
Specialized Quality > quantity (curate)
Generic Network effect (UGC)

기본값: 매 implicit feedback + 매 quality classifier + 매 RLHF + 매 A/B test.

🔗 Graph

🤖 LLM 활용

언제: 매 AI startup strategy. 매 product roadmap. 매 moat assessment. 매 fundraising 의 differentiator. 언제 X: 매 commodity (no flywheel possible).

안티패턴

  • Data hoarding (no use): 매 flywheel X.
  • Quality 의 ignore: 매 noise 의 amplify.
  • No feedback collection: 매 cycle 의 break.
  • Privacy violation: 매 regulatory + trust loss.
  • "Data is moat" 의 unconditional 신뢰: 매 LLM 의 commodity.
  • Synthetic data only: 매 model collapse.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — cycle + cold start + 매 RLHF / A/B / federated / quality code