d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
179 lines
6.4 KiB
Markdown
179 lines
6.4 KiB
Markdown
---
|
|
id: wiki-2026-0508-sparse-attention
|
|
title: Sparse Attention
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Sparse Self-Attention, Local Attention, Efficient Attention]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [transformer, attention, long-context, efficient-llm, sparse]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: Python
|
|
framework: PyTorch/FlashAttention
|
|
---
|
|
|
|
# Sparse Attention
|
|
|
|
## 매 한 줄
|
|
> **"매 O(n²) attention 의 O(n·k) 의 reduce — token pair 의 subset 만 compute"**. Sparse attention 매 long-context Transformer 의 enabler, sliding-window + global tokens (Longformer) 매 base, BigBird/LongNet 매 random + dilated. 매 2026 매 native sparse (DeepSeek NSA, MoBA) + SSM hybrid (Mamba2) + FlashAttention-3 sparse mask 매 production.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 sparsity pattern
|
|
- **Sliding window**: 매 ±w tokens. Local context. (Longformer, Mistral SWA).
|
|
- **Global tokens**: 매 [CLS] + 특정 token 매 모든 token 의 attend.
|
|
- **Dilated**: 매 stride k. Long-range w/o full O(n²).
|
|
- **Random**: 매 random k tokens. BigBird 의 component.
|
|
- **Block sparse**: 매 block-diagonal + selected blocks. FlashAttention 친화.
|
|
- **Learned/adaptive**: 매 routing network 의 어디 sparse 의 decide (NSA, MoBA 2025).
|
|
|
|
### 매 historical landmarks
|
|
- **Sparse Transformer** (OpenAI 2019): factorized attention.
|
|
- **Longformer** (AllenAI 2020): SWA + global. 4k→16k+ tokens.
|
|
- **BigBird** (Google 2020): random + window + global. Theoretically 의 full-attn approximate.
|
|
- **LongNet** (Microsoft 2023): dilated → 1B token claim.
|
|
- **NSA** (DeepSeek 2025): native sparse 매 pretraining.
|
|
- **MoBA** (Moonshot 2025): mixture-of-block-attention, hierarchical sparsity.
|
|
|
|
### 매 응용
|
|
1. Long-document QA / summarization.
|
|
2. Code-base wide LLM analysis (Claude 1M context).
|
|
3. Genomics / DNA Transformer.
|
|
4. Video Transformer (frames as tokens).
|
|
|
|
## 💻 패턴
|
|
|
|
### Sliding window mask
|
|
```python
|
|
import torch
|
|
|
|
def sliding_window_mask(n: int, w: int, device='cuda') -> torch.Tensor:
|
|
"""Boolean mask: True = allow attention."""
|
|
idx = torch.arange(n, device=device)
|
|
diff = idx.unsqueeze(0) - idx.unsqueeze(1)
|
|
return diff.abs() <= w
|
|
```
|
|
|
|
### Longformer-style (SWA + global)
|
|
```python
|
|
def longformer_mask(n: int, w: int, global_idx: list[int], device='cuda'):
|
|
mask = sliding_window_mask(n, w, device)
|
|
g = torch.zeros(n, dtype=torch.bool, device=device)
|
|
g[global_idx] = True
|
|
# global attends to all + all attend to global
|
|
mask = mask | g.unsqueeze(0) | g.unsqueeze(1)
|
|
return mask
|
|
```
|
|
|
|
### BigBird random component
|
|
```python
|
|
def random_attention_mask(n: int, k: int, device='cuda') -> torch.Tensor:
|
|
"""Each token attends to k random others."""
|
|
mask = torch.zeros(n, n, dtype=torch.bool, device=device)
|
|
for i in range(n):
|
|
idx = torch.randperm(n, device=device)[:k]
|
|
mask[i, idx] = True
|
|
return mask
|
|
```
|
|
|
|
### FlashAttention-2/3 with custom mask
|
|
```python
|
|
from flash_attn import flash_attn_func
|
|
# (b, s, h, d) — fp16/bf16
|
|
# FA-3 supports block-sparse mask via mask_mod (PyTorch 2.5+ FlexAttention)
|
|
out = flash_attn_func(q, k, v, causal=True, window_size=(512, 0))
|
|
# window_size=(left, right) — Mistral-style SWA
|
|
```
|
|
|
|
### FlexAttention (PyTorch 2.5+)
|
|
```python
|
|
from torch.nn.attention.flex_attention import flex_attention, create_block_mask
|
|
|
|
def sliding_window(b, h, q_idx, kv_idx):
|
|
return (q_idx - kv_idx).abs() <= 512
|
|
|
|
block_mask = create_block_mask(sliding_window, B=None, H=None, Q_LEN=8192, KV_LEN=8192)
|
|
out = flex_attention(q, k, v, block_mask=block_mask)
|
|
```
|
|
|
|
### Block-sparse (DeepSeek NSA pseudo)
|
|
```python
|
|
def block_sparse_attn(q, k, v, block_size=64, top_k_blocks=8):
|
|
# 1. Compute block-level importance via mean-pooled K
|
|
n_blocks = k.shape[1] // block_size
|
|
k_blocks = k.view(*k.shape[:1], n_blocks, block_size, *k.shape[2:]).mean(dim=2)
|
|
scores = torch.einsum('bnhd,bmhd->bnmh', q, k_blocks)
|
|
# 2. Select top-k blocks per query
|
|
_, top_idx = scores.topk(top_k_blocks, dim=2)
|
|
# 3. Gather + dense attn within
|
|
return _gather_and_attend(q, k, v, top_idx, block_size)
|
|
```
|
|
|
|
### Mistral SWA (HuggingFace)
|
|
```python
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
"mistralai/Mistral-7B-v0.3",
|
|
torch_dtype=torch.bfloat16,
|
|
attn_implementation="flash_attention_2",
|
|
sliding_window=4096,
|
|
)
|
|
```
|
|
|
|
### Adaptive top-k token (Native Sparse)
|
|
```python
|
|
def topk_attention(q, k, v, top_k=128):
|
|
# (b, h, s, d)
|
|
scores = q @ k.transpose(-2, -1) / q.shape[-1]**0.5
|
|
# Keep top_k per query
|
|
top_v, top_i = scores.topk(top_k, dim=-1)
|
|
sparse = torch.full_like(scores, float('-inf'))
|
|
sparse.scatter_(-1, top_i, top_v)
|
|
attn = sparse.softmax(dim=-1)
|
|
return attn @ v
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| 4-32k context, local-mostly | Sliding window (Mistral SWA) |
|
|
| Long-doc QA w/ key-token | Longformer (SWA + global) |
|
|
| 100k+ context, hardware-friendly | Block-sparse + FlashAttention |
|
|
| Native long-context pretraining | NSA / MoBA (2025+) |
|
|
| Inference-only swap | Top-k token sparsification |
|
|
|
|
**기본값**: 매 inference 매 SWA + FlashAttention; 매 pretraining 매 native sparse (NSA-like).
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Attention Mechanism]] · [[Transformer]]
|
|
- 변형: [[Flash Attention]] · [[Mamba]]
|
|
- Adjacent: [[KV-Cache]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 attention pattern selection rationale, 매 mask code draft, 매 paper distillation (NSA/MoBA).
|
|
**언제 X**: 매 production kernel write (use FA-3 / FlexAttention), 매 perf measurement (real benchmark).
|
|
|
|
## ❌ 안티패턴
|
|
- **Naïve mask + softmax**: 매 O(n²) memory still. 매 -inf masking 매 helps compute 의 X.
|
|
- **Random sparsity only**: 매 quality drop 매 catastrophic. 매 hybrid (window + global) needed.
|
|
- **Fixed window for all heads**: 매 head 마다 different need. 매 per-head adaptive 의 better.
|
|
- **No global token**: 매 long-doc QA 매 [CLS]/question token 의 全文 access 의 lose.
|
|
- **Window too small**: 매 perplexity 매 baseline 의 break.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Longformer arXiv:2004.05150; BigBird arXiv:2007.14062; FlashAttention-3 2024; DeepSeek NSA 2025; PyTorch FlexAttention).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — full content (SWA/BigBird/NSA + FlashAttention/FlexAttention patterns) |
|