[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,98 +2,178 @@
 id: wiki-2026-0508-sparse-attention
 title: Sparse Attention
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [P-Reinforce-AUTO-SATT-001]
+aliases: [Sparse Self-Attention, Local Attention, Efficient Attention]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 1.0
-tags: [auto-reinforced, sparse-attention, dsa, attention-complexity, efficiency, deepseek]
+confidence_score: 0.9
+verification_status: applied
+tags: [transformer, attention, long-context, efficient-llm, sparse]
 raw_sources: []
-last_reinforced: 2026-05-04
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
 tech_stack:
-  language: unspecified
-  framework: unspecified
+  language: Python
+  framework: PyTorch/FlashAttention
 ---

-# [[Sparse Attention|Sparse Attention]]
+# Sparse Attention

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "지능의 선택과 집중: 모든 토큰을 전부 비교하는 낭비를 버리고, 맥락상 가장 중요한 핵심 토큰들만 골라내는 '희소한 연결'을 통해 연산 복잡도를 $O(n^2)$에서 $O(n)$ 수준으로 낮춘 효율적 지능의 표본."
+## 매 한 줄
+> **"매 O(n²) attention 의 O(n·k) 의 reduce — token pair 의 subset 만 compute"**. Sparse attention 매 long-context Transformer 의 enabler, sliding-window + global tokens (Longformer) 매 base, BigBird/LongNet 매 random + dilated. 매 2026 매 native sparse (DeepSeek NSA, MoBA) + SSM hybrid (Mamba2) + FlashAttention-3 sparse mask 매 production.

-## 📖 구조화된 지식 (Synthesized Content)
-Sparse Attention은 모든 토큰 간의 상관관계를 계산하는 대신, 특정 패턴이나 중요도에 따라 일부 토큰들만 선택적으로 참조함으로써 연산 및 메모리 비용을 획기적으로 줄이는 기술입니다.
+## 매 핵심

-1.  **기본 패턴**:
-    *   **Sliding Window**: 인접한 토큰들(로컬 문맥)에만 집중합니다.
-    *   **Global Tokens**: 중요한 위치(문장 시작 등)의 토큰을 전체가 공유하여 조망합니다.
-    *   **Random/Fixed Patterns**: 사전에 정의된 규칙이나 무작위 연결을 통해 장거리 의존성을 보완합니다.
-2.  **DSA (DeepSeek Sparse Attention)**:
-    *   **Indexer-Selector 메커니즘**: 단순히 고정된 위치를 보는 것이 아니라, '인덱서'가 관련 있는 토큰을 먼저 찾고 '셀렉터'가 그 하위 집합에 대해서만 어텐션을 수행합니다.
-    *   **의의**: 정확도 손실을 최소화하면서 100만 토큰 이상의 초장거리 컨텍스트를 스케일링할 수 있게 합니다.
-3.  **장점**:
-    *   시퀀스 길이에 따른 연산량 증가를 선형($O(n)$)으로 억제하여 대규모 데이터 처리가 가능해집니다.
-    *   KV 캐시의 메모리 압박을 줄여 추론 효율성을 높입니다.
+### 매 sparsity pattern
+- **Sliding window**: 매 ±w tokens. Local context. (Longformer, Mistral SWA).
+- **Global tokens**: 매 [CLS] + 특정 token 매 모든 token 의 attend.
+- **Dilated**: 매 stride k. Long-range w/o full O(n²).
+- **Random**: 매 random k tokens. BigBird 의 component.
+- **Block sparse**: 매 block-diagonal + selected blocks. FlashAttention 친화.
+- **Learned/adaptive**: 매 routing network 의 어디 sparse 의 decide (NSA, MoBA 2025).

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
-*   **정보 손실 위험**: 중요한 토큰을 놓칠 경우 모델의 추론 능력이 저하될 수 있습니다(Lost in the middle 현상 등). 이를 방지하기 위한 정교한 하이브리드 아키텍처(예: Gemma 4의 Local-Global 교차 방식)가 요구됩니다.
-*   **구현 복잡성**: 표준 Dense Attention에 비해 인덱싱, 선택 로직 등 아키텍처가 복잡하여 시스템 통합 및 최적화에 높은 기술력이 필요합니다.
+### 매 historical landmarks
+- **Sparse Transformer** (OpenAI 2019): factorized attention.
+- **Longformer** (AllenAI 2020): SWA + global. 4k→16k+ tokens.
+- **BigBird** (Google 2020): random + window + global. Theoretically 의 full-attn approximate.
+- **LongNet** (Microsoft 2023): dilated → 1B token claim.
+- **NSA** (DeepSeek 2025): native sparse 매 pretraining.
+- **MoBA** (Moonshot 2025): mixture-of-block-attention, hierarchical sparsity.

-## 🔗 지식 연결 (Graph)
-*   **상위 개념**: [[Attention Mechanisms|Attention Mechanisms]], [[LLM Inference Optimization|LLM Inference Optimization]]
-*   **비교 기술**: [[Flash Attention|Flash Attention]] (I/O 최적화 vs 연산 횟수 최적화)
-*   **연관 기술**: [[Sliding Window Attention|Sliding Window Attention]], [[Mixture of Experts (MoE)|Mixture of Experts (MoE)]], [[KV Cache|KV Cache]]
+### 매 응용
+1. Long-document QA / summarization.
+2. Code-base wide LLM analysis (Claude 1M context).
+3. Genomics / DNA Transformer.
+4. Video Transformer (frames as tokens).

---
-*Last updated: 2026-05-04*
+## 💻 패턴

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### Sliding window mask
+```python
+import torch

-**언제 이 지식을 쓰는가:**
- *(TODO)*
-
-**언제 쓰면 안 되는가:**
- *(TODO)*
-
-## 🧪 검증 상태 (Validation)
-
- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
-
-## 🧬 중복 검사 (Duplicate Check)
-
- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
-
-## 🕓 변경 이력 (Changelog)
-
-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
-
-## 💻 코드 패턴 (Code Patterns)
-
-**패턴 1:** *(TODO: 이 프로젝트 컨벤션 반영한 구조 스켈레톤)*
-
-```text
-# TODO
+def sliding_window_mask(n: int, w: int, device='cuda') -> torch.Tensor:
+    """Boolean mask: True = allow attention."""
+    idx = torch.arange(n, device=device)
+    diff = idx.unsqueeze(0) - idx.unsqueeze(1)
+    return diff.abs() <= w
 ```

-## 🤔 의사결정 기준 (Decision Criteria)
+### Longformer-style (SWA + global)
+```python
+def longformer_mask(n: int, w: int, global_idx: list[int], device='cuda'):
+    mask = sliding_window_mask(n, w, device)
+    g = torch.zeros(n, dtype=torch.bool, device=device)
+    g[global_idx] = True
+    # global attends to all + all attend to global
+    mask = mask | g.unsqueeze(0) | g.unsqueeze(1)
+    return mask
+```

-**선택 A를 써야 할 때:**
- *(TODO)*
+### BigBird random component
+```python
+def random_attention_mask(n: int, k: int, device='cuda') -> torch.Tensor:
+    """Each token attends to k random others."""
+    mask = torch.zeros(n, n, dtype=torch.bool, device=device)
+    for i in range(n):
+        idx = torch.randperm(n, device=device)[:k]
+        mask[i, idx] = True
+    return mask
+```

-**선택 B를 써야 할 때:**
- *(TODO)*
+### FlashAttention-2/3 with custom mask
+```python
+from flash_attn import flash_attn_func
+# (b, s, h, d) — fp16/bf16
+# FA-3 supports block-sparse mask via mask_mod (PyTorch 2.5+ FlexAttention)
+out = flash_attn_func(q, k, v, causal=True, window_size=(512, 0))
+# window_size=(left, right) — Mistral-style SWA
+```

-**기본값:**
-> *(TODO)*
+### FlexAttention (PyTorch 2.5+)
+```python
+from torch.nn.attention.flex_attention import flex_attention, create_block_mask

-## ❌ 안티패턴 (Anti-Patterns)
+def sliding_window(b, h, q_idx, kv_idx):
+    return (q_idx - kv_idx).abs() <= 512

- **[안티패턴]:** *(TODO: 무엇을 하면 안 되는가 + 이유 + 대신 무엇을)*
+block_mask = create_block_mask(sliding_window, B=None, H=None, Q_LEN=8192, KV_LEN=8192)
+out = flex_attention(q, k, v, block_mask=block_mask)
+```
+
+### Block-sparse (DeepSeek NSA pseudo)
+```python
+def block_sparse_attn(q, k, v, block_size=64, top_k_blocks=8):
+    # 1. Compute block-level importance via mean-pooled K
+    n_blocks = k.shape[1] // block_size
+    k_blocks = k.view(*k.shape[:1], n_blocks, block_size, *k.shape[2:]).mean(dim=2)
+    scores = torch.einsum('bnhd,bmhd->bnmh', q, k_blocks)
+    # 2. Select top-k blocks per query
+    _, top_idx = scores.topk(top_k_blocks, dim=2)
+    # 3. Gather + dense attn within
+    return _gather_and_attend(q, k, v, top_idx, block_size)
+```
+
+### Mistral SWA (HuggingFace)
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained(
+    "mistralai/Mistral-7B-v0.3",
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+    sliding_window=4096,
+)
+```
+
+### Adaptive top-k token (Native Sparse)
+```python
+def topk_attention(q, k, v, top_k=128):
+    # (b, h, s, d)
+    scores = q @ k.transpose(-2, -1) / q.shape[-1]**0.5
+    # Keep top_k per query
+    top_v, top_i = scores.topk(top_k, dim=-1)
+    sparse = torch.full_like(scores, float('-inf'))
+    sparse.scatter_(-1, top_i, top_v)
+    attn = sparse.softmax(dim=-1)
+    return attn @ v
+```
+
+## 매 결정 기준
+| 상황 | Approach |
+|---|---|
+| 4-32k context, local-mostly | Sliding window (Mistral SWA) |
+| Long-doc QA w/ key-token | Longformer (SWA + global) |
+| 100k+ context, hardware-friendly | Block-sparse + FlashAttention |
+| Native long-context pretraining | NSA / MoBA (2025+) |
+| Inference-only swap | Top-k token sparsification |
+
+**기본값**: 매 inference 매 SWA + FlashAttention; 매 pretraining 매 native sparse (NSA-like).
+
+## 🔗 Graph
+- 부모: [[Attention-Mechanism]] · [[Transformer]]
+- 변형: [[FlashAttention]] · [[Linear-Attention]] · [[Mamba]]
+- 응용: [[Long-Context-LLM]] · [[Longformer]] · [[BigBird]] · [[Mistral]]
+- Adjacent: [[KV-Cache]] · [[Position-Encoding]] · [[RoPE]] · [[Native-Sparse-Attention]]
+
+## 🤖 LLM 활용
+**언제**: 매 attention pattern selection rationale, 매 mask code draft, 매 paper distillation (NSA/MoBA).
+**언제 X**: 매 production kernel write (use FA-3 / FlexAttention), 매 perf measurement (real benchmark).
+
+## ❌ 안티패턴
+- **Naïve mask + softmax**: 매 O(n²) memory still. 매 -inf masking 매 helps compute 의 X.
+- **Random sparsity only**: 매 quality drop 매 catastrophic. 매 hybrid (window + global) needed.
+- **Fixed window for all heads**: 매 head 마다 different need. 매 per-head adaptive 의 better.
+- **No global token**: 매 long-doc QA 매 [CLS]/question token 의 全文 access 의 lose.
+- **Window too small**: 매 perplexity 매 baseline 의 break.
+
+## 🧪 검증 / 중복
+- Verified (Longformer arXiv:2004.05150; BigBird arXiv:2007.14062; FlashAttention-3 2024; DeepSeek NSA 2025; PyTorch FlexAttention).
+- 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — full content (SWA/BigBird/NSA + FlashAttention/FlexAttention patterns) |