d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8.0 KiB
8.0 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-data-flywheel | Data Flywheel Effect | 10_Wiki/Topics | verified | self |
|
none | B | 0.88 | applied |
|
2026-05-10 | pending |
|
Data Flywheel Effect
매 한 줄
"매 model → 매 product → 매 user → 매 data → 매 better model". 매 AI 의 defensible moat 의 source. 매 cold start 의 hardest. 매 modern: 매 LLM 시대 의 quality flywheel (RLHF, user feedback). 매 critique: 매 quantity ≠ moat.
매 핵심 cycle
- Better model.
- Better product / UX.
- More users.
- More data (interaction).
- Model improvement → 매 1.
매 conditions for flywheel
- Network effect of data: 매 user 1 → user 2 의 benefit.
- Reinvestment: 매 data → 매 model improvement loop.
- Speed: 매 cycle 의 cycle 의 빠름.
- Quality matters: 매 noise 의 ↑ 의 model 의 degrade.
매 examples
Strong flywheel
- Google Search: 매 click → 매 ranking.
- Tesla FSD: 매 mile → 매 model.
- Spotify: 매 listen → 매 recommend.
- Waze: 매 traffic → 매 routing.
- Duolingo: 매 mistake → 매 SRS.
Weak / Failed
- Many startup AI: 매 data 의 collect 가 매 use X.
- Generic chatbot: 매 user feedback X.
매 moat strength factor
- Data exclusivity: 매 own only.
- Data quality: 매 noise filter.
- Data freshness: 매 update speed.
- Network density: 매 user 의 interaction.
- Switching cost: 매 lock-in.
- Privacy compliance: 매 GDPR.
매 cold start strategy
- Hand-curate: 매 first 1000 user 의 manually.
- Synthetic data: 매 simulate.
- Open data: 매 Wikipedia, 매 CommonCrawl.
- Acquisition: 매 dataset 의 buy.
- Lighthouse customer: 매 large customer 의 data.
- Product-led growth: 매 free tier.
매 modern (LLM era)
- RLHF: 매 user preference 의 collect.
- Implicit feedback: 매 thumbs up / down, 매 dwell time.
- A/B: 매 model variant.
- User correction: 매 manual edit.
매 risks
- Bias amplification: 매 own user 의 bias 의 reinforce.
- Echo chamber: 매 narrow.
- Privacy: 매 PII.
- Regulatory: 매 EU AI Act.
- Model collapse: 매 synthetic training.
매 critique
- "Data is not the new oil — it's the new sand." (cheap, abundant)
- 매 LLM era 의 base model 의 commoditize.
- 매 quality > quantity.
- 매 application-layer 의 differentiate.
💻 패턴
Flywheel measurement
def flywheel_health(metrics):
return {
'data_growth_rate': (metrics.data_now - metrics.data_year_ago) / metrics.data_year_ago,
'model_improvement_rate': (metrics.eval_now - metrics.eval_year_ago) / metrics.eval_year_ago,
'user_growth_rate': metrics.users_now / metrics.users_year_ago,
'data_per_user': metrics.data_now / metrics.users_now,
'feedback_rate': metrics.feedback_count / metrics.user_interaction_count,
}
Implicit feedback collection
def collect_implicit_feedback(user_id, response_id, signal_type, value):
"""매 dwell time, scroll depth, copy, share."""
db.feedback.insert({
'user_id': user_id,
'response_id': response_id,
'signal': signal_type, # 매 'dwell', 'copy', 'share', 'edit'
'value': value,
'timestamp': datetime.now(),
})
# 매 매 dwell > 30 sec → 매 positive signal.
RLHF data pipeline
def rlhf_pipeline():
# 매 1. user interaction
interactions = collect_interactions()
# 매 2. preference pair generation
pairs = []
for i in interactions:
if i.has_thumbs_up_and_down_in_session:
pairs.append({
'prompt': i.prompt,
'chosen': i.thumbs_up_response,
'rejected': i.thumbs_down_response,
})
# 매 3. quality filter
pairs = filter_quality(pairs)
# 매 4. DPO / RLHF train
train_dpo(pairs)
# 매 5. shadow deploy
shadow_test_new_model()
# 매 6. gradual rollout
canary_deploy(percentage=5)
Cold start: synthetic data
def bootstrap_cold_start(use_case, n=1000):
"""매 synthetic data 의 first model 의 train."""
examples = []
for _ in range(n):
seed = generate_seed_for(use_case)
synthetic = llm.generate(f"""Generate a realistic example for: {use_case}
Input: ...
Expected output: ...""")
examples.append(synthetic)
return examples
A/B test (model improvement signal)
def ab_test_model(model_old, model_new, traffic_pct=10):
def assign(user_id):
return 'new' if hash(user_id) % 100 < traffic_pct else 'old'
metrics = collect_metrics_by_variant(assign)
if statistical_significance(metrics) and metrics['new'] > metrics['old']:
promote(model_new)
Data quality scoring
def score_training_example(example, base_model):
"""매 매 example 의 quality 의 estimate."""
score = 0
score += has_diverse_vocab(example) * 0.2
score += not_repetitive(example) * 0.2
score += factually_consistent(example) * 0.3
score += task_clarity(example) * 0.3
return score
# 매 top-K 의 select for training.
Privacy-preserving learning
# 매 federated learning
def federated_update(global_model, client_data_chunks):
local_updates = []
for client_chunk in client_data_chunks:
local_model = global_model.copy()
local_model.train(client_chunk)
local_updates.append(local_model.weights - global_model.weights)
# 매 average update only — 매 raw data 의 leave 의 X
global_model.weights += avg(local_updates)
return global_model
Defensibility audit
def defensibility_score(metrics):
score = 0
if metrics.proprietary_data_exclusivity: score += 3
if metrics.user_lock_in > 0.5: score += 2
if metrics.network_density > 0.7: score += 2
if metrics.data_quality_unique: score += 2
if metrics.regulatory_barrier: score += 1
return f'Moat strength: {score}/10'
매 결정 기준
| 상황 | Strategy |
|---|---|
| Cold start | Synthetic + open data + lighthouse customer |
| Growing | Implicit feedback + A/B |
| Scale | RLHF + automation |
| Sensitive | Federated + DP |
| Specialized | Quality > quantity (curate) |
| Generic | Network effect (UGC) |
기본값: 매 implicit feedback + 매 quality classifier + 매 RLHF + 매 A/B test.
🔗 Graph
- 부모: Defensibility
- 변형: Network-Effect · Data-Moat · Cold-Start
- 응용: RLHF · DPO · Federated-Learning
- Adjacent: Concept-Drift · Cost-Benefit Analysis in AI · Asset-Specific-Knowledge · Algorithmic Fairness
🤖 LLM 활용
언제: 매 AI startup strategy. 매 product roadmap. 매 moat assessment. 매 fundraising 의 differentiator. 언제 X: 매 commodity (no flywheel possible).
❌ 안티패턴
- Data hoarding (no use): 매 flywheel X.
- Quality 의 ignore: 매 noise 의 amplify.
- No feedback collection: 매 cycle 의 break.
- Privacy violation: 매 regulatory + trust loss.
- "Data is moat" 의 unconditional 신뢰: 매 LLM 의 commodity.
- Synthetic data only: 매 model collapse.
🧪 검증 / 중복
- Verified (Andreessen Horowitz "Data Network Effects", Reid Hoffman, Tesla / Google case studies).
- 신뢰도 B.
- Related: Cost-Benefit Analysis in AI · Concept-Drift · Asset-Specific-Knowledge · CV_Synthesis · Algorithmic Fairness.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — cycle + cold start + 매 RLHF / A/B / federated / quality code |