--- id: wiki-2026-0508-comment-harvester title: comment_harvester (YouTube/Reddit Comment Scraper) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Comment Scraper, Comment Pipeline, Social Comment ETL] duplicate_of: none source_trust_level: A confidence_score: 0.85 verification_status: applied tags: [scraping, youtube-api, reddit-api, etl, sentiment, llm-pipeline] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python 3.12 / TypeScript framework: yt-dlp + YouTube Data API v3 / PRAW / DuckDB --- # comment_harvester (YouTube/Reddit Comment Scraper) ## 매 한 줄 > **"매 YouTube/Reddit/etc 의 comment 를 매 paginated API 로 fetch → normalize → store, 그리고 매 LLM 의 batch sentiment/topic extraction 으로 enrich 하는 매 pipeline"**. 2026 표준 stack: yt-dlp/YT-DATA-API + PRAW + DuckDB + Claude/GPT batch API. 매 use case: market research, content idea mining, brand monitoring. ## 매 핵심 ### 매 Source matrix | Platform | Auth | Rate limit | Library | |---|---|---|---| | YouTube | OAuth/API key | 10k units/day default | google-api-python-client, yt-dlp | | Reddit | OAuth (PRAW) | 100 req/min | praw, asyncpraw | | Twitter/X | API tier (paid) | varies | tweepy | | TikTok | unofficial | volatile | TikTokApi | | Instagram | private API | very volatile | instagrapi | ### 매 Pipeline 단계 1. **Source resolve**: video URL/ID, subreddit, channel. 2. **Fetch**: paginated, with `nextPageToken` / `after`. 3. **Normalize**: `{ id, parentId, author, text, ts, likes, replies, sourceMeta }`. 4. **Dedupe + store**: DuckDB / Postgres. 5. **Enrich**: LLM batch (sentiment, topic, language, toxicity). 6. **Serve**: SQL / Streamlit / API. ### 매 Ethical / legal - 매 Public comments only. Robots.txt + ToS respect. - 매 PII redact (email, phone in text). - GDPR: deletion 의 honor. - 매 commercial use 의 platform-specific 제약 — 매 read carefully. ### 매 응용 1. Channel-level sentiment trend (매 video 마다). 2. Topic clustering (Claude embedding + UMAP + HDBSCAN). 3. Auto-FAQ from creator's recurring questions. 4. Competitor brand mention. ## 💻 패턴 ### YouTube Data API v3 ```python from googleapiclient.discovery import build import os yt = build('youtube', 'v3', developerKey=os.environ['YT_KEY']) def fetch_comments(video_id: str, max_pages=100): page_token = None for _ in range(max_pages): resp = yt.commentThreads().list( part='snippet,replies', videoId=video_id, maxResults=100, pageToken=page_token, textFormat='plainText', ).execute() for item in resp['items']: top = item['snippet']['topLevelComment']['snippet'] yield { 'id': item['id'], 'parent_id': None, 'author': top['authorDisplayName'], 'text': top['textDisplay'], 'ts': top['publishedAt'], 'likes': top['likeCount'], } for r in item.get('replies', {}).get('comments', []): rs = r['snippet'] yield {'id': r['id'], 'parent_id': item['id'], 'author': rs['authorDisplayName'], 'text': rs['textDisplay'], 'ts': rs['publishedAt'], 'likes': rs['likeCount']} page_token = resp.get('nextPageToken') if not page_token: break ``` ### Reddit (asyncpraw) ```python import asyncpraw, asyncio async def fetch_subreddit(name: str, limit=200): reddit = asyncpraw.Reddit(client_id=..., client_secret=..., user_agent='harvester/1.0') sub = await reddit.subreddit(name) async for submission in sub.new(limit=limit): await submission.comments.replace_more(limit=0) for c in submission.comments.list(): yield {'id': c.id, 'parent_id': c.parent_id, 'author': str(c.author), 'text': c.body, 'ts': c.created_utc, 'likes': c.score, 'submission_id': submission.id} ``` ### DuckDB sink ```python import duckdb con = duckdb.connect('comments.db') con.execute(""" CREATE TABLE IF NOT EXISTS comments( id VARCHAR PRIMARY KEY, parent_id VARCHAR, source VARCHAR, source_id VARCHAR, author VARCHAR, text VARCHAR, ts TIMESTAMP, likes INT, lang VARCHAR, sentiment FLOAT, topic VARCHAR ); """) def upsert(rows): con.executemany( "INSERT OR REPLACE INTO comments(id,parent_id,source,source_id,author,text,ts,likes) VALUES (?,?,?,?,?,?,?,?)", rows, ) ``` ### LLM batch enrich (Claude Message Batches) ```python from anthropic import Anthropic client = Anthropic() requests = [{ "custom_id": row['id'], "params": { "model": "claude-opus-4-7", "max_tokens": 200, "messages": [{"role": "user", "content": f"Output JSON {{lang, sentiment(-1..1), topic(<=3 words)}} for: {row['text']}"}], }, } for row in batch] batch = client.messages.batches.create(requests=requests) # poll batch.id until completed, then parse results ``` ### Incremental cron ```python # crontab: 0 */6 * * * import sys, datetime as dt last = con.execute("SELECT max(ts) FROM comments WHERE source='yt' AND source_id=?", [vid]).fetchone()[0] since = last or dt.datetime.utcnow() - dt.timedelta(days=30) for c in fetch_comments(vid): if dt.datetime.fromisoformat(c['ts'].rstrip('Z')) <= since: break upsert([(c['id'], c['parent_id'], 'yt', vid, c['author'], c['text'], c['ts'], c['likes'])]) ``` ### Topic clustering ```python from anthropic import Anthropic import umap, hdbscan, numpy as np client = Anthropic() texts = [r[0] for r in con.execute("SELECT text FROM comments WHERE topic IS NULL LIMIT 5000").fetchall()] embeds = [] # 매 embedding API 또는 voyage-3 proj = umap.UMAP(n_components=10, metric='cosine').fit_transform(np.array(embeds)) labels = hdbscan.HDBSCAN(min_cluster_size=20).fit_predict(proj) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | ≤ 100 videos / day | YT Data API + key (free quota) | | Heavy crawl | yt-dlp `--write-comments` (no API quota, but slower) | | Reddit live monitor | PRAW streaming `subreddit.stream.comments()` | | Storage | DuckDB (single-node analytics), Postgres (multi-tenant) | | Enrichment cost | Claude Batch (50% off) > realtime API | | Real-time alert | Reddit stream + Slack webhook | **기본값**: YT Data API + DuckDB + Claude Batch enrichment + 6h incremental cron. ## 🔗 Graph - 변형: [[my_videos_check]] · [[WebHooks_and_Notifications|telegram_notify]] - 응용: [[Sentiment-Analysis]] - Adjacent: [[DuckDB]] ## 🤖 LLM 활용 **언제**: pipeline scaffold, normalization schema, batch prompt design. **언제 X**: ToS / legal review — 매 platform-specific lawyer 의 read. ## ❌ 안티패턴 - **No rate-limit handling**: 매 quota 의 burn → 매 24h ban. - **Storing raw text without dedupe**: 매 storage explode + double-enrich cost. - **Realtime LLM per comment**: cost 의 50× higher than batch. - **Ignoring deleted-comment lifecycle**: stale data + GDPR violation. - **API key in code**: 매 .env + secret manager. ## 🧪 검증 / 중복 - Verified (YouTube Data API v3, PRAW 7.7+, Claude Message Batches docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — comment harvest pipeline + LLM enrichment |