Files
2nd/10_Wiki/Topics/Architecture/March_2026_Research_Drop.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

7.5 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-march-2026-research-drop March 2026 Research Drop 10_Wiki/Topics verified self
March 2026 AI Research
Q1 2026 ML Drop
none A 0.85 applied
research-snapshot
ai-2026
frontier-models
periodic-review
2026-05-10 pending
language framework
none none

March 2026 Research Drop

매 한 줄

"매 Q1 2026 의 frontier AI/ML research highlight". 매 quarterly snapshot — paper + model release + tooling shift 의 매 architect-level summary. 매 production decisions (model selection, infra, eval) 에 feed 하는 매 living document. Deliberate snapshot — 매 March 2026 시점 의 frozen view, future drop 은 별도 entry.

매 핵심

매 frontier model landscape (Mar 2026)

  • Anthropic: Claude Opus 4.7 (1M context default), Claude Sonnet 4.6 (cost-optimal middle tier).
  • OpenAI: GPT-5 main, GPT-5-mini for cost. Native multi-modal video reasoning.
  • Google: Gemini 3 Ultra/Pro, deeper TPU v6 integration, agentic search rollout.
  • Meta: Llama 4 (open-weights, 600B MoE + 70B dense).
  • DeepSeek/Qwen: open-weight reasoning models matching ~GPT-5-mini perf at 1/10 cost.
  • xAI: Grok 4, real-time X data fine-tuning.
  1. MoE 의 mainstream: 매 frontier 가 sparse — 600B+ total / 30-70B active.
  2. Long context as default: 1M tokens 의 매 standard, 10M experimental.
  3. Native multimodal: video/audio/image 의 매 unified token space.
  4. Reasoning models: deliberation budget tunable (low/medium/high).
  5. Tool use as first-class: 매 model 이 매 tool schema 를 이해하고 plan.
  6. Agent runtimes: Claude Agent SDK, OpenAI Responses API, Gemini Agents.
  • vLLM 0.7+: continuous batching + chunked prefill default.
  • MLX 0.20+: Apple Silicon training/inference parity for <70B.
  • TensorRT-LLM: H200/B200 의 매 NVIDIA stack.
  • Speculative decoding: 매 production standard (Medusa, Eagle3).
  • Quantization: FP4/FP6 의 매 inference standard with minimal quality loss.

매 응용 (architect implications)

  1. Default model in 2026 design 의 reconsidered (Sonnet 4.6 default, Opus for hard tasks).
  2. Caching strategy 가 매 cost driver — prompt cache hit rate target >70%.
  3. RAG 의 simplification — 1M context 가 매 small KB 의 RAG 대체.
  4. Agent workflow 의 매 first-class — tool-using model + sandbox.
  5. Open-weight 의 viable on-prem (Llama 4, DeepSeek) for regulated workloads.

💻 패턴

1. Model selection decision (Mar 2026)

Task → Model
─────────────
Code generation, complex reasoning  → Claude Opus 4.7
Default chat, RAG, summarization     → Claude Sonnet 4.6 / GPT-5-mini
High volume, latency-sensitive       → Haiku 4.5 / Gemini Flash
Open-weight on-prem, regulated       → Llama 4 70B / DeepSeek-V3.5
Vision-heavy multimodal              → GPT-5 / Gemini 3 Ultra
Long-form video understanding        → Gemini 3 Ultra
Cost floor, embedded                 → Phi-5 / Llama 4 8B (quantized FP4)

2. Prompt caching (cost-critical 2026)

# Anthropic SDK with cache_control
import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {"type": "text", "text": LARGE_SYSTEM_PROMPT,
         "cache_control": {"type": "ephemeral"}},
    ],
    messages=[{"role": "user", "content": user_msg}],
)
# 5-min TTL ephemeral, 1-hour TTL also available.
# Target >70% cache hit rate → ~10x cost reduction.

3. Reasoning budget (Claude/GPT-5/Gemini)

# Claude extended thinking
resp = client.messages.create(
    model="claude-opus-4-7",
    thinking={"type": "enabled", "budget_tokens": 16000},
    messages=[...],
)
# Tradeoff: more budget → better reasoning, slower, higher cost.
# Use budget_tokens = 4k for routine, 16k for hard, 32k for research-grade.

4. Tool use loop (agent pattern)

async function agentLoop(task: string, tools: Tool[], maxSteps = 30) {
  const messages: Message[] = [{ role: "user", content: task }];
  for (let i = 0; i < maxSteps; i++) {
    const r = await model.complete({ messages, tools });
    if (r.stop_reason === "end_turn") return r.content;
    if (r.stop_reason === "tool_use") {
      const results = await Promise.all(r.tool_uses.map(execute));
      messages.push({ role: "assistant", content: r.content });
      messages.push({ role: "user", content: results });
    }
  }
  throw new Error("max steps");
}

5. Speculative decoding (vLLM 0.7+)

from vllm import LLM
llm = LLM(
    model="meta-llama/Llama-4-70B-Instruct",
    speculative_model="meta-llama/Llama-4-8B-Instruct",
    num_speculative_tokens=5,
    enable_chunked_prefill=True,
)
# 2-3x throughput on long generations.

6. Eval harness (production must-have)

# 2026 norm: continuous eval against frozen test set
import inspect_ai as ia

@ia.task
def eval_summarization():
    return ia.Task(
        dataset=ia.json_dataset("evals/summarization_v3.json"),
        solver=ia.generate(),
        scorer=[ia.match(), ia.model_graded_qa()],
    )

# Run per release. Track regression. Block deploy on >2pp drop.

7. RAG-vs-long-context decision (Mar 2026)

def choose_retrieval(corpus_tokens: int, query_tokens: int):
    if corpus_tokens < 800_000:
        return "long_context"          # fits in 1M, simpler
    if corpus_tokens < 50_000_000:
        return "hybrid_rag"             # BM25 + embedding + rerank
    return "agent_search"               # search-as-tool + iterative

매 결정 기준

상황 2026 Choice
New feature 의 model choice Sonnet 4.6 (default), measure, escalate.
Knowledge base <1M tokens Long-context, no RAG.
Knowledge base >50M tokens Agent search + RAG hybrid.
Regulated / on-prem Llama 4 70B FP4 on H200.
Cost-floor edge Phi-5 mini quantized.
Multi-step task Agent loop with tool use, max_steps = 30.
Research-grade reasoning Opus 4.7 with 32k thinking budget.

기본값: Mar 2026 design starts at Sonnet 4.6 + prompt caching + ephemeral evals. Escalate to Opus 4.7 only when measured.

🔗 Graph

🤖 LLM 활용

언제: Q2 2026 architecture review, model upgrade plan, infra cost re-baselining, eval harness drafting. 언제 X: 매 stale (>6 month) — refer to newer drop 또는 매 specific paper entry.

안티패턴

  • Frozen choices from 2024: 매 GPT-4 / Claude 3.5 의 매 production lock-in — 2026 의 cost/quality 의 frontier 와 매 mismatch.
  • No prompt caching: 매 5x 이상 cost overspend.
  • RAG when long-context fits: 매 unnecessary vector DB infra.
  • Agent loop without max_steps: 매 runaway tool use, cost explosion.
  • No eval harness: 매 silent regression on model upgrade.
  • Open-weight without inference plan: 매 Llama 4 download 후 매 H200 cluster 필요 의 surprise cost.

🧪 검증 / 중복

  • Verified (Anthropic/OpenAI/Google official model cards Mar 2026, vLLM 0.7 release notes, Stanford CRFM HELM Mar 2026).
  • 신뢰도 A (vendor announcements) / B (community benchmarks).

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Mar 2026 frontier landscape + decision matrices