Files
2nd/10_Wiki/Topics/Backend/comment_harvester.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

204 lines
7.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-comment-harvester
title: comment_harvester (YouTube/Reddit Comment Scraper)
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Comment Scraper, Comment Pipeline, Social Comment ETL]
duplicate_of: none
source_trust_level: A
confidence_score: 0.85
verification_status: applied
tags: [scraping, youtube-api, reddit-api, etl, sentiment, llm-pipeline]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python 3.12 / TypeScript
framework: yt-dlp + YouTube Data API v3 / PRAW / DuckDB
---
# comment_harvester (YouTube/Reddit Comment Scraper)
## 매 한 줄
> **"매 YouTube/Reddit/etc 의 comment 를 매 paginated API 로 fetch → normalize → store, 그리고 매 LLM 의 batch sentiment/topic extraction 으로 enrich 하는 매 pipeline"**. 2026 표준 stack: yt-dlp/YT-DATA-API + PRAW + DuckDB + Claude/GPT batch API. 매 use case: market research, content idea mining, brand monitoring.
## 매 핵심
### 매 Source matrix
| Platform | Auth | Rate limit | Library |
|---|---|---|---|
| YouTube | OAuth/API key | 10k units/day default | google-api-python-client, yt-dlp |
| Reddit | OAuth (PRAW) | 100 req/min | praw, asyncpraw |
| Twitter/X | API tier (paid) | varies | tweepy |
| TikTok | unofficial | volatile | TikTokApi |
| Instagram | private API | very volatile | instagrapi |
### 매 Pipeline 단계
1. **Source resolve**: video URL/ID, subreddit, channel.
2. **Fetch**: paginated, with `nextPageToken` / `after`.
3. **Normalize**: `{ id, parentId, author, text, ts, likes, replies, sourceMeta }`.
4. **Dedupe + store**: DuckDB / Postgres.
5. **Enrich**: LLM batch (sentiment, topic, language, toxicity).
6. **Serve**: SQL / Streamlit / API.
### 매 Ethical / legal
- 매 Public comments only. Robots.txt + ToS respect.
- 매 PII redact (email, phone in text).
- GDPR: deletion 의 honor.
- 매 commercial use 의 platform-specific 제약 — 매 read carefully.
### 매 응용
1. Channel-level sentiment trend (매 video 마다).
2. Topic clustering (Claude embedding + UMAP + HDBSCAN).
3. Auto-FAQ from creator's recurring questions.
4. Competitor brand mention.
## 💻 패턴
### YouTube Data API v3
```python
from googleapiclient.discovery import build
import os
yt = build('youtube', 'v3', developerKey=os.environ['YT_KEY'])
def fetch_comments(video_id: str, max_pages=100):
page_token = None
for _ in range(max_pages):
resp = yt.commentThreads().list(
part='snippet,replies', videoId=video_id,
maxResults=100, pageToken=page_token, textFormat='plainText',
).execute()
for item in resp['items']:
top = item['snippet']['topLevelComment']['snippet']
yield {
'id': item['id'],
'parent_id': None,
'author': top['authorDisplayName'],
'text': top['textDisplay'],
'ts': top['publishedAt'],
'likes': top['likeCount'],
}
for r in item.get('replies', {}).get('comments', []):
rs = r['snippet']
yield {'id': r['id'], 'parent_id': item['id'],
'author': rs['authorDisplayName'], 'text': rs['textDisplay'],
'ts': rs['publishedAt'], 'likes': rs['likeCount']}
page_token = resp.get('nextPageToken')
if not page_token: break
```
### Reddit (asyncpraw)
```python
import asyncpraw, asyncio
async def fetch_subreddit(name: str, limit=200):
reddit = asyncpraw.Reddit(client_id=..., client_secret=..., user_agent='harvester/1.0')
sub = await reddit.subreddit(name)
async for submission in sub.new(limit=limit):
await submission.comments.replace_more(limit=0)
for c in submission.comments.list():
yield {'id': c.id, 'parent_id': c.parent_id, 'author': str(c.author),
'text': c.body, 'ts': c.created_utc, 'likes': c.score,
'submission_id': submission.id}
```
### DuckDB sink
```python
import duckdb
con = duckdb.connect('comments.db')
con.execute("""
CREATE TABLE IF NOT EXISTS comments(
id VARCHAR PRIMARY KEY, parent_id VARCHAR, source VARCHAR,
source_id VARCHAR, author VARCHAR, text VARCHAR,
ts TIMESTAMP, likes INT, lang VARCHAR, sentiment FLOAT, topic VARCHAR
);
""")
def upsert(rows):
con.executemany(
"INSERT OR REPLACE INTO comments(id,parent_id,source,source_id,author,text,ts,likes) VALUES (?,?,?,?,?,?,?,?)",
rows,
)
```
### LLM batch enrich (Claude Message Batches)
```python
from anthropic import Anthropic
client = Anthropic()
requests = [{
"custom_id": row['id'],
"params": {
"model": "claude-opus-4-7",
"max_tokens": 200,
"messages": [{"role": "user", "content":
f"Output JSON {{lang, sentiment(-1..1), topic(<=3 words)}} for: {row['text']}"}],
},
} for row in batch]
batch = client.messages.batches.create(requests=requests)
# poll batch.id until completed, then parse results
```
### Incremental cron
```python
# crontab: 0 */6 * * *
import sys, datetime as dt
last = con.execute("SELECT max(ts) FROM comments WHERE source='yt' AND source_id=?", [vid]).fetchone()[0]
since = last or dt.datetime.utcnow() - dt.timedelta(days=30)
for c in fetch_comments(vid):
if dt.datetime.fromisoformat(c['ts'].rstrip('Z')) <= since: break
upsert([(c['id'], c['parent_id'], 'yt', vid, c['author'], c['text'], c['ts'], c['likes'])])
```
### Topic clustering
```python
from anthropic import Anthropic
import umap, hdbscan, numpy as np
client = Anthropic()
texts = [r[0] for r in con.execute("SELECT text FROM comments WHERE topic IS NULL LIMIT 5000").fetchall()]
embeds = [] # 매 embedding API 또는 voyage-3
proj = umap.UMAP(n_components=10, metric='cosine').fit_transform(np.array(embeds))
labels = hdbscan.HDBSCAN(min_cluster_size=20).fit_predict(proj)
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| ≤ 100 videos / day | YT Data API + key (free quota) |
| Heavy crawl | yt-dlp `--write-comments` (no API quota, but slower) |
| Reddit live monitor | PRAW streaming `subreddit.stream.comments()` |
| Storage | DuckDB (single-node analytics), Postgres (multi-tenant) |
| Enrichment cost | Claude Batch (50% off) > realtime API |
| Real-time alert | Reddit stream + Slack webhook |
**기본값**: YT Data API + DuckDB + Claude Batch enrichment + 6h incremental cron.
## 🔗 Graph
- 변형: [[my_videos_check]] · [[WebHooks_and_Notifications|telegram_notify]]
- 응용: [[Sentiment-Analysis]]
- Adjacent: [[DuckDB]]
## 🤖 LLM 활용
**언제**: pipeline scaffold, normalization schema, batch prompt design.
**언제 X**: ToS / legal review — 매 platform-specific lawyer 의 read.
## ❌ 안티패턴
- **No rate-limit handling**: 매 quota 의 burn → 매 24h ban.
- **Storing raw text without dedupe**: 매 storage explode + double-enrich cost.
- **Realtime LLM per comment**: cost 의 50× higher than batch.
- **Ignoring deleted-comment lifecycle**: stale data + GDPR violation.
- **API key in code**: 매 .env + secret manager.
## 🧪 검증 / 중복
- Verified (YouTube Data API v3, PRAW 7.7+, Claude Message Batches docs).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — comment harvest pipeline + LLM enrichment |