Files
2nd/10_Wiki/Topics/Backend/comment_harvester.md
T
2026-05-10 22:08:15 +09:00

7.3 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-comment-harvester comment_harvester (YouTube/Reddit Comment Scraper) 10_Wiki/Topics verified self
Comment Scraper
Comment Pipeline
Social Comment ETL
none A 0.85 applied
scraping
youtube-api
reddit-api
etl
sentiment
llm-pipeline
2026-05-10 pending
language framework
Python 3.12 / TypeScript yt-dlp + YouTube Data API v3 / PRAW / DuckDB

comment_harvester (YouTube/Reddit Comment Scraper)

매 한 줄

"매 YouTube/Reddit/etc 의 comment 를 매 paginated API 로 fetch → normalize → store, 그리고 매 LLM 의 batch sentiment/topic extraction 으로 enrich 하는 매 pipeline". 2026 표준 stack: yt-dlp/YT-DATA-API + PRAW + DuckDB + Claude/GPT batch API. 매 use case: market research, content idea mining, brand monitoring.

매 핵심

매 Source matrix

Platform Auth Rate limit Library
YouTube OAuth/API key 10k units/day default google-api-python-client, yt-dlp
Reddit OAuth (PRAW) 100 req/min praw, asyncpraw
Twitter/X API tier (paid) varies tweepy
TikTok unofficial volatile TikTokApi
Instagram private API very volatile instagrapi

매 Pipeline 단계

  1. Source resolve: video URL/ID, subreddit, channel.
  2. Fetch: paginated, with nextPageToken / after.
  3. Normalize: { id, parentId, author, text, ts, likes, replies, sourceMeta }.
  4. Dedupe + store: DuckDB / Postgres.
  5. Enrich: LLM batch (sentiment, topic, language, toxicity).
  6. Serve: SQL / Streamlit / API.
  • 매 Public comments only. Robots.txt + ToS respect.
  • 매 PII redact (email, phone in text).
  • GDPR: deletion 의 honor.
  • 매 commercial use 의 platform-specific 제약 — 매 read carefully.

매 응용

  1. Channel-level sentiment trend (매 video 마다).
  2. Topic clustering (Claude embedding + UMAP + HDBSCAN).
  3. Auto-FAQ from creator's recurring questions.
  4. Competitor brand mention.

💻 패턴

YouTube Data API v3

from googleapiclient.discovery import build
import os

yt = build('youtube', 'v3', developerKey=os.environ['YT_KEY'])

def fetch_comments(video_id: str, max_pages=100):
    page_token = None
    for _ in range(max_pages):
        resp = yt.commentThreads().list(
            part='snippet,replies', videoId=video_id,
            maxResults=100, pageToken=page_token, textFormat='plainText',
        ).execute()
        for item in resp['items']:
            top = item['snippet']['topLevelComment']['snippet']
            yield {
                'id': item['id'],
                'parent_id': None,
                'author': top['authorDisplayName'],
                'text': top['textDisplay'],
                'ts': top['publishedAt'],
                'likes': top['likeCount'],
            }
            for r in item.get('replies', {}).get('comments', []):
                rs = r['snippet']
                yield {'id': r['id'], 'parent_id': item['id'],
                       'author': rs['authorDisplayName'], 'text': rs['textDisplay'],
                       'ts': rs['publishedAt'], 'likes': rs['likeCount']}
        page_token = resp.get('nextPageToken')
        if not page_token: break

Reddit (asyncpraw)

import asyncpraw, asyncio

async def fetch_subreddit(name: str, limit=200):
    reddit = asyncpraw.Reddit(client_id=..., client_secret=..., user_agent='harvester/1.0')
    sub = await reddit.subreddit(name)
    async for submission in sub.new(limit=limit):
        await submission.comments.replace_more(limit=0)
        for c in submission.comments.list():
            yield {'id': c.id, 'parent_id': c.parent_id, 'author': str(c.author),
                   'text': c.body, 'ts': c.created_utc, 'likes': c.score,
                   'submission_id': submission.id}

DuckDB sink

import duckdb
con = duckdb.connect('comments.db')
con.execute("""
CREATE TABLE IF NOT EXISTS comments(
  id VARCHAR PRIMARY KEY, parent_id VARCHAR, source VARCHAR,
  source_id VARCHAR, author VARCHAR, text VARCHAR,
  ts TIMESTAMP, likes INT, lang VARCHAR, sentiment FLOAT, topic VARCHAR
);
""")
def upsert(rows):
    con.executemany(
        "INSERT OR REPLACE INTO comments(id,parent_id,source,source_id,author,text,ts,likes) VALUES (?,?,?,?,?,?,?,?)",
        rows,
    )

LLM batch enrich (Claude Message Batches)

from anthropic import Anthropic
client = Anthropic()

requests = [{
    "custom_id": row['id'],
    "params": {
        "model": "claude-opus-4-7",
        "max_tokens": 200,
        "messages": [{"role": "user", "content":
            f"Output JSON {{lang, sentiment(-1..1), topic(<=3 words)}} for: {row['text']}"}],
    },
} for row in batch]

batch = client.messages.batches.create(requests=requests)
# poll batch.id until completed, then parse results

Incremental cron

# crontab: 0 */6 * * *
import sys, datetime as dt
last = con.execute("SELECT max(ts) FROM comments WHERE source='yt' AND source_id=?", [vid]).fetchone()[0]
since = last or dt.datetime.utcnow() - dt.timedelta(days=30)
for c in fetch_comments(vid):
    if dt.datetime.fromisoformat(c['ts'].rstrip('Z')) <= since: break
    upsert([(c['id'], c['parent_id'], 'yt', vid, c['author'], c['text'], c['ts'], c['likes'])])

Topic clustering

from anthropic import Anthropic
import umap, hdbscan, numpy as np
client = Anthropic()

texts = [r[0] for r in con.execute("SELECT text FROM comments WHERE topic IS NULL LIMIT 5000").fetchall()]
embeds = []  # 매 embedding API 또는 voyage-3
proj = umap.UMAP(n_components=10, metric='cosine').fit_transform(np.array(embeds))
labels = hdbscan.HDBSCAN(min_cluster_size=20).fit_predict(proj)

매 결정 기준

상황 Approach
≤ 100 videos / day YT Data API + key (free quota)
Heavy crawl yt-dlp --write-comments (no API quota, but slower)
Reddit live monitor PRAW streaming subreddit.stream.comments()
Storage DuckDB (single-node analytics), Postgres (multi-tenant)
Enrichment cost Claude Batch (50% off) > realtime API
Real-time alert Reddit stream + Slack webhook

기본값: YT Data API + DuckDB + Claude Batch enrichment + 6h incremental cron.

🔗 Graph

🤖 LLM 활용

언제: pipeline scaffold, normalization schema, batch prompt design. 언제 X: ToS / legal review — 매 platform-specific lawyer 의 read.

안티패턴

  • No rate-limit handling: 매 quota 의 burn → 매 24h ban.
  • Storing raw text without dedupe: 매 storage explode + double-enrich cost.
  • Realtime LLM per comment: cost 의 50× higher than batch.
  • Ignoring deleted-comment lifecycle: stale data + GDPR violation.
  • API key in code: 매 .env + secret manager.

🧪 검증 / 중복

  • Verified (YouTube Data API v3, PRAW 7.7+, Claude Message Batches docs).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — comment harvest pipeline + LLM enrichment