Files
2nd/10_Wiki/Topics/AI_and_ML/Research.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

7.8 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-research Research 10_Wiki/Topics verified self
Research-Methodology
Literature-Review
Deep-Research
none A 0.9 applied
research
methodology
literature
ai-aided
2026-05-10 pending
language framework
python anthropic-sdk

Research

매 한 줄

"매 모든 답은 누군가 이미 reformulated". Research는 매 question → literature → synthesis → novel contribution의 매 disciplined loop — 2026 의 매 AI-aided synthesis (Claude Opus 4.7 deep research, GPT-5 with browsing, Elicit, Consensus, undermind.ai) 가 매 weeks of work 를 매 hours로 단축.

매 핵심

매 Phases

  1. Question framing — vague curiosity → specific testable question (PICO, FINER criteria).
  2. Literature scoping — keywords, citation graph (forward/backward), Connected Papers / Litmaps.
  3. Reading & extraction — structured notes (Zettelkasten, claim-evidence-source).
  4. Synthesis — themes, gaps, contradictions.
  5. Hypothesis / contribution — what novel claim this work adds.
  6. Validation — experiment / proof / case study.
  7. Communication — paper, blog, talk.

매 Modern toolchain (2026)

  • Search: Semantic Scholar API, Google Scholar, OpenAlex.
  • Discovery: Connected Papers, Litmaps, Inciteful (citation graph viz).
  • AI synthesis: Claude Opus 4.7 deep-research mode, GPT-5 deep research, Elicit (extracts data per paper), Consensus (claim-level), undermind.ai (deep retrieval).
  • Notes: Obsidian + Zotero integration; Logseq; Reflect.
  • Reproducibility: Quarto, Jupyter Book, Code Ocean.

매 AI-aided literature review pattern

  1. Seed papers (35 known relevant) → Connected Papers graph.
  2. Snowball (citations both ways) → ~100 candidates.
  3. LLM screen abstracts: relevance score 010.
  4. Top 30 → full-text PDF → AI structured extraction (claim, method, evidence, limitations).
  5. AI cluster into themes; human reviews + writes synthesis.

매 안전장치 (필수)

  • 매 hallucination 의 적: AI 의 매 fake citation 매 흔함 → DOI 의 매 verify 의 must.
  • 매 echo chamber: AI synthesis 의 매 popular sources 매 over-weight → manually 의 매 deliberate diverse sampling.
  • 매 confirmation bias: AI 의 매 user의 매 hypothesis 매 align — 매 explicit "steelman opposite" prompt.

매 응용

  1. PhD literature review.
  2. Industry tech radar / market research.
  3. Due diligence (M&A, investment).
  4. Pre-implementation prior-art search (patents, OSS).

💻 패턴

Claude deep-research synthesis (verify-first)

from anthropic import Anthropic
import httpx

client = Anthropic()

def synthesize(question: str, papers: list[dict]) -> str:
    """papers: [{title, abstract, doi, year}]"""
    corpus = "\n\n".join(
        f"[{i}] {p['title']} ({p['year']}, doi:{p['doi']})\n{p['abstract']}"
        for i, p in enumerate(papers)
    )
    msg = client.messages.create(
        model="claude-opus-4-7", max_tokens=4096,
        system=("Synthesize evidence. Cite EVERY claim with [index]. "
                "If evidence is weak/contradictory, say so explicitly. "
                "Never fabricate citations."),
        messages=[{"role": "user", "content": f"Q: {question}\n\nPapers:\n{corpus}"}],
    )
    return msg.content[0].text

def verify_dois(text: str, papers: list[dict]) -> list[str]:
    """Hallucination check — every cited DOI must exist in our set."""
    import re
    cited = re.findall(r"doi:(10\.\d+/\S+)", text)
    valid = {p["doi"] for p in papers}
    return [d for d in cited if d not in valid]  # offenders

Semantic Scholar fetch

def search_s2(query: str, limit: int = 50) -> list[dict]:
    r = httpx.get(
        "https://api.semanticscholar.org/graph/v1/paper/search",
        params={"query": query, "limit": limit,
                "fields": "title,abstract,year,citationCount,externalIds"},
    ).json()
    return [{"title": p["title"], "abstract": p.get("abstract") or "",
             "year": p.get("year"), "doi": p.get("externalIds", {}).get("DOI"),
             "cites": p["citationCount"]}
            for p in r["data"]]

Snowball expansion

def snowball(seed_ids: list[str], depth: int = 2) -> set[str]:
    frontier, seen = set(seed_ids), set(seed_ids)
    for _ in range(depth):
        next_frontier = set()
        for pid in frontier:
            r = httpx.get(f"https://api.semanticscholar.org/graph/v1/paper/{pid}/references",
                          params={"fields": "paperId", "limit": 100}).json()
            next_frontier.update(ref["citedPaper"]["paperId"]
                                 for ref in r.get("data", [])
                                 if ref["citedPaper"].get("paperId"))
        frontier = next_frontier - seen
        seen.update(frontier)
    return seen

Structured extraction prompt

EXTRACT_PROMPT = """Extract from this paper as JSON:
{
  "claim": "main thesis in one sentence",
  "method": "how they tested it",
  "evidence": "key result with numbers",
  "n": "sample size",
  "limitations": ["limit1", "limit2"],
  "novelty": "what this adds vs prior work"
}
If field unknown, use null. Don't invent."""

Steelman opposite (debias)

def steelman(claim: str) -> str:
    return client.messages.create(
        model="claude-opus-4-7", max_tokens=1024,
        messages=[{"role": "user", "content":
            f"Claim: {claim}\n\nWrite the strongest argument AGAINST this, "
            f"citing actual contrary evidence. Be a hostile reviewer."}],
    ).content[0].text

Zettelkasten note (atomic)

---
id: 2026-05-10-1432
tags: [retrieval, rag]
source: [[Lewis-2020-RAG]]
---
# Dense retrieval beats BM25 only when query-doc lexical overlap is low

In Lewis 2020 (Table 3), DPR > BM25 on NaturalQuestions (+6 EM)
but BM25 ≥ DPR on TriviaQA where queries copy doc tokens.

→ Hybrid search is robust: pick BM25 for lexical, dense for paraphrase.

Connects to: [[Hybrid Search]] · [[BM25]] · [[Dense-Retrieval]]

매 결정 기준

상황 Approach
매 빠른 scan (1h) Elicit / Consensus / Claude deep-research
매 deep dive (1주) Manual snowball + AI extraction
Systematic review (PRISMA) PRISMA flow + Covidence + AI screening
매 cutting-edge (preprints) arXiv-sanity + Twitter/Bluesky + Semantic Scholar alerts
매 industry / OSS GitHub trending + State of X reports + AI synthesis

기본값: Connected Papers seed → S2 snowball → AI extract → manual synthesis with steelman.

🔗 Graph

🤖 LLM 활용

언제: literature scan, abstract screening, structured extraction, synthesis draft, steelmanning. 언제 X: novelty claim 의 매 final assertion (LLM 의 매 ground truth 의 X), 매 quantitative meta-analysis (use proper stats software), 매 citation 의 verify 없이.

안티패턴

  • Cite-without-verify: AI 의 매 만들어낸 fake DOI.
  • Single-source synthesis: 매 한 paper 의 매 truth로 취급 — 매 replication 의 무시.
  • Recency bias: 매 latest preprint 만 → 매 foundational work 의 무지.
  • No gap analysis: literature dump 의 매 only — 매 "what's missing" 의 부재 → contribution 의 unclear.
  • Hypothesis fishing: 매 data 부터 → 매 post-hoc theory (HARKing).

🧪 검증 / 중복

  • Verified (PRISMA 2020 statement, Semantic Scholar API docs, Claude Opus 4.7 deep research, Elicit methodology).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — full rewrite covering methodology + AI-aided synthesis pipeline