f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
204 lines
7.2 KiB
Markdown
204 lines
7.2 KiB
Markdown
---
|
||
id: wiki-2026-0508-comment-harvester
|
||
title: comment_harvester (YouTube/Reddit Comment Scraper)
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [Comment Scraper, Comment Pipeline, Social Comment ETL]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.85
|
||
verification_status: applied
|
||
tags: [scraping, youtube-api, reddit-api, etl, sentiment, llm-pipeline]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: Python 3.12 / TypeScript
|
||
framework: yt-dlp + YouTube Data API v3 / PRAW / DuckDB
|
||
---
|
||
|
||
# comment_harvester (YouTube/Reddit Comment Scraper)
|
||
|
||
## 매 한 줄
|
||
> **"매 YouTube/Reddit/etc 의 comment 를 매 paginated API 로 fetch → normalize → store, 그리고 매 LLM 의 batch sentiment/topic extraction 으로 enrich 하는 매 pipeline"**. 2026 표준 stack: yt-dlp/YT-DATA-API + PRAW + DuckDB + Claude/GPT batch API. 매 use case: market research, content idea mining, brand monitoring.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 Source matrix
|
||
| Platform | Auth | Rate limit | Library |
|
||
|---|---|---|---|
|
||
| YouTube | OAuth/API key | 10k units/day default | google-api-python-client, yt-dlp |
|
||
| Reddit | OAuth (PRAW) | 100 req/min | praw, asyncpraw |
|
||
| Twitter/X | API tier (paid) | varies | tweepy |
|
||
| TikTok | unofficial | volatile | TikTokApi |
|
||
| Instagram | private API | very volatile | instagrapi |
|
||
|
||
### 매 Pipeline 단계
|
||
1. **Source resolve**: video URL/ID, subreddit, channel.
|
||
2. **Fetch**: paginated, with `nextPageToken` / `after`.
|
||
3. **Normalize**: `{ id, parentId, author, text, ts, likes, replies, sourceMeta }`.
|
||
4. **Dedupe + store**: DuckDB / Postgres.
|
||
5. **Enrich**: LLM batch (sentiment, topic, language, toxicity).
|
||
6. **Serve**: SQL / Streamlit / API.
|
||
|
||
### 매 Ethical / legal
|
||
- 매 Public comments only. Robots.txt + ToS respect.
|
||
- 매 PII redact (email, phone in text).
|
||
- GDPR: deletion 의 honor.
|
||
- 매 commercial use 의 platform-specific 제약 — 매 read carefully.
|
||
|
||
### 매 응용
|
||
1. Channel-level sentiment trend (매 video 마다).
|
||
2. Topic clustering (Claude embedding + UMAP + HDBSCAN).
|
||
3. Auto-FAQ from creator's recurring questions.
|
||
4. Competitor brand mention.
|
||
|
||
## 💻 패턴
|
||
|
||
### YouTube Data API v3
|
||
```python
|
||
from googleapiclient.discovery import build
|
||
import os
|
||
|
||
yt = build('youtube', 'v3', developerKey=os.environ['YT_KEY'])
|
||
|
||
def fetch_comments(video_id: str, max_pages=100):
|
||
page_token = None
|
||
for _ in range(max_pages):
|
||
resp = yt.commentThreads().list(
|
||
part='snippet,replies', videoId=video_id,
|
||
maxResults=100, pageToken=page_token, textFormat='plainText',
|
||
).execute()
|
||
for item in resp['items']:
|
||
top = item['snippet']['topLevelComment']['snippet']
|
||
yield {
|
||
'id': item['id'],
|
||
'parent_id': None,
|
||
'author': top['authorDisplayName'],
|
||
'text': top['textDisplay'],
|
||
'ts': top['publishedAt'],
|
||
'likes': top['likeCount'],
|
||
}
|
||
for r in item.get('replies', {}).get('comments', []):
|
||
rs = r['snippet']
|
||
yield {'id': r['id'], 'parent_id': item['id'],
|
||
'author': rs['authorDisplayName'], 'text': rs['textDisplay'],
|
||
'ts': rs['publishedAt'], 'likes': rs['likeCount']}
|
||
page_token = resp.get('nextPageToken')
|
||
if not page_token: break
|
||
```
|
||
|
||
### Reddit (asyncpraw)
|
||
```python
|
||
import asyncpraw, asyncio
|
||
|
||
async def fetch_subreddit(name: str, limit=200):
|
||
reddit = asyncpraw.Reddit(client_id=..., client_secret=..., user_agent='harvester/1.0')
|
||
sub = await reddit.subreddit(name)
|
||
async for submission in sub.new(limit=limit):
|
||
await submission.comments.replace_more(limit=0)
|
||
for c in submission.comments.list():
|
||
yield {'id': c.id, 'parent_id': c.parent_id, 'author': str(c.author),
|
||
'text': c.body, 'ts': c.created_utc, 'likes': c.score,
|
||
'submission_id': submission.id}
|
||
```
|
||
|
||
### DuckDB sink
|
||
```python
|
||
import duckdb
|
||
con = duckdb.connect('comments.db')
|
||
con.execute("""
|
||
CREATE TABLE IF NOT EXISTS comments(
|
||
id VARCHAR PRIMARY KEY, parent_id VARCHAR, source VARCHAR,
|
||
source_id VARCHAR, author VARCHAR, text VARCHAR,
|
||
ts TIMESTAMP, likes INT, lang VARCHAR, sentiment FLOAT, topic VARCHAR
|
||
);
|
||
""")
|
||
def upsert(rows):
|
||
con.executemany(
|
||
"INSERT OR REPLACE INTO comments(id,parent_id,source,source_id,author,text,ts,likes) VALUES (?,?,?,?,?,?,?,?)",
|
||
rows,
|
||
)
|
||
```
|
||
|
||
### LLM batch enrich (Claude Message Batches)
|
||
```python
|
||
from anthropic import Anthropic
|
||
client = Anthropic()
|
||
|
||
requests = [{
|
||
"custom_id": row['id'],
|
||
"params": {
|
||
"model": "claude-opus-4-7",
|
||
"max_tokens": 200,
|
||
"messages": [{"role": "user", "content":
|
||
f"Output JSON {{lang, sentiment(-1..1), topic(<=3 words)}} for: {row['text']}"}],
|
||
},
|
||
} for row in batch]
|
||
|
||
batch = client.messages.batches.create(requests=requests)
|
||
# poll batch.id until completed, then parse results
|
||
```
|
||
|
||
### Incremental cron
|
||
```python
|
||
# crontab: 0 */6 * * *
|
||
import sys, datetime as dt
|
||
last = con.execute("SELECT max(ts) FROM comments WHERE source='yt' AND source_id=?", [vid]).fetchone()[0]
|
||
since = last or dt.datetime.utcnow() - dt.timedelta(days=30)
|
||
for c in fetch_comments(vid):
|
||
if dt.datetime.fromisoformat(c['ts'].rstrip('Z')) <= since: break
|
||
upsert([(c['id'], c['parent_id'], 'yt', vid, c['author'], c['text'], c['ts'], c['likes'])])
|
||
```
|
||
|
||
### Topic clustering
|
||
```python
|
||
from anthropic import Anthropic
|
||
import umap, hdbscan, numpy as np
|
||
client = Anthropic()
|
||
|
||
texts = [r[0] for r in con.execute("SELECT text FROM comments WHERE topic IS NULL LIMIT 5000").fetchall()]
|
||
embeds = [] # 매 embedding API 또는 voyage-3
|
||
proj = umap.UMAP(n_components=10, metric='cosine').fit_transform(np.array(embeds))
|
||
labels = hdbscan.HDBSCAN(min_cluster_size=20).fit_predict(proj)
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 상황 | Approach |
|
||
|---|---|
|
||
| ≤ 100 videos / day | YT Data API + key (free quota) |
|
||
| Heavy crawl | yt-dlp `--write-comments` (no API quota, but slower) |
|
||
| Reddit live monitor | PRAW streaming `subreddit.stream.comments()` |
|
||
| Storage | DuckDB (single-node analytics), Postgres (multi-tenant) |
|
||
| Enrichment cost | Claude Batch (50% off) > realtime API |
|
||
| Real-time alert | Reddit stream + Slack webhook |
|
||
|
||
**기본값**: YT Data API + DuckDB + Claude Batch enrichment + 6h incremental cron.
|
||
|
||
## 🔗 Graph
|
||
- 변형: [[my_videos_check]] · [[WebHooks_and_Notifications|telegram_notify]]
|
||
- 응용: [[Sentiment-Analysis]]
|
||
- Adjacent: [[DuckDB]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: pipeline scaffold, normalization schema, batch prompt design.
|
||
**언제 X**: ToS / legal review — 매 platform-specific lawyer 의 read.
|
||
|
||
## ❌ 안티패턴
|
||
- **No rate-limit handling**: 매 quota 의 burn → 매 24h ban.
|
||
- **Storing raw text without dedupe**: 매 storage explode + double-enrich cost.
|
||
- **Realtime LLM per comment**: cost 의 50× higher than batch.
|
||
- **Ignoring deleted-comment lifecycle**: stale data + GDPR violation.
|
||
- **API key in code**: 매 .env + secret manager.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (YouTube Data API v3, PRAW 7.7+, Claude Message Batches docs).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — comment harvest pipeline + LLM enrichment |
|