--- id: wiki-2026-0508-sparse-attention title: Sparse Attention category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Sparse Self-Attention, Local Attention, Efficient Attention] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [transformer, attention, long-context, efficient-llm, sparse] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: PyTorch/FlashAttention --- # Sparse Attention ## 매 한 줄 > **"매 O(n²) attention 의 O(n·k) 의 reduce — token pair 의 subset 만 compute"**. Sparse attention 매 long-context Transformer 의 enabler, sliding-window + global tokens (Longformer) 매 base, BigBird/LongNet 매 random + dilated. 매 2026 매 native sparse (DeepSeek NSA, MoBA) + SSM hybrid (Mamba2) + FlashAttention-3 sparse mask 매 production. ## 매 핵심 ### 매 sparsity pattern - **Sliding window**: 매 ±w tokens. Local context. (Longformer, Mistral SWA). - **Global tokens**: 매 [CLS] + 특정 token 매 모든 token 의 attend. - **Dilated**: 매 stride k. Long-range w/o full O(n²). - **Random**: 매 random k tokens. BigBird 의 component. - **Block sparse**: 매 block-diagonal + selected blocks. FlashAttention 친화. - **Learned/adaptive**: 매 routing network 의 어디 sparse 의 decide (NSA, MoBA 2025). ### 매 historical landmarks - **Sparse Transformer** (OpenAI 2019): factorized attention. - **Longformer** (AllenAI 2020): SWA + global. 4k→16k+ tokens. - **BigBird** (Google 2020): random + window + global. Theoretically 의 full-attn approximate. - **LongNet** (Microsoft 2023): dilated → 1B token claim. - **NSA** (DeepSeek 2025): native sparse 매 pretraining. - **MoBA** (Moonshot 2025): mixture-of-block-attention, hierarchical sparsity. ### 매 응용 1. Long-document QA / summarization. 2. Code-base wide LLM analysis (Claude 1M context). 3. Genomics / DNA Transformer. 4. Video Transformer (frames as tokens). ## 💻 패턴 ### Sliding window mask ```python import torch def sliding_window_mask(n: int, w: int, device='cuda') -> torch.Tensor: """Boolean mask: True = allow attention.""" idx = torch.arange(n, device=device) diff = idx.unsqueeze(0) - idx.unsqueeze(1) return diff.abs() <= w ``` ### Longformer-style (SWA + global) ```python def longformer_mask(n: int, w: int, global_idx: list[int], device='cuda'): mask = sliding_window_mask(n, w, device) g = torch.zeros(n, dtype=torch.bool, device=device) g[global_idx] = True # global attends to all + all attend to global mask = mask | g.unsqueeze(0) | g.unsqueeze(1) return mask ``` ### BigBird random component ```python def random_attention_mask(n: int, k: int, device='cuda') -> torch.Tensor: """Each token attends to k random others.""" mask = torch.zeros(n, n, dtype=torch.bool, device=device) for i in range(n): idx = torch.randperm(n, device=device)[:k] mask[i, idx] = True return mask ``` ### FlashAttention-2/3 with custom mask ```python from flash_attn import flash_attn_func # (b, s, h, d) — fp16/bf16 # FA-3 supports block-sparse mask via mask_mod (PyTorch 2.5+ FlexAttention) out = flash_attn_func(q, k, v, causal=True, window_size=(512, 0)) # window_size=(left, right) — Mistral-style SWA ``` ### FlexAttention (PyTorch 2.5+) ```python from torch.nn.attention.flex_attention import flex_attention, create_block_mask def sliding_window(b, h, q_idx, kv_idx): return (q_idx - kv_idx).abs() <= 512 block_mask = create_block_mask(sliding_window, B=None, H=None, Q_LEN=8192, KV_LEN=8192) out = flex_attention(q, k, v, block_mask=block_mask) ``` ### Block-sparse (DeepSeek NSA pseudo) ```python def block_sparse_attn(q, k, v, block_size=64, top_k_blocks=8): # 1. Compute block-level importance via mean-pooled K n_blocks = k.shape[1] // block_size k_blocks = k.view(*k.shape[:1], n_blocks, block_size, *k.shape[2:]).mean(dim=2) scores = torch.einsum('bnhd,bmhd->bnmh', q, k_blocks) # 2. Select top-k blocks per query _, top_idx = scores.topk(top_k_blocks, dim=2) # 3. Gather + dense attn within return _gather_and_attend(q, k, v, top_idx, block_size) ``` ### Mistral SWA (HuggingFace) ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-v0.3", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", sliding_window=4096, ) ``` ### Adaptive top-k token (Native Sparse) ```python def topk_attention(q, k, v, top_k=128): # (b, h, s, d) scores = q @ k.transpose(-2, -1) / q.shape[-1]**0.5 # Keep top_k per query top_v, top_i = scores.topk(top_k, dim=-1) sparse = torch.full_like(scores, float('-inf')) sparse.scatter_(-1, top_i, top_v) attn = sparse.softmax(dim=-1) return attn @ v ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | 4-32k context, local-mostly | Sliding window (Mistral SWA) | | Long-doc QA w/ key-token | Longformer (SWA + global) | | 100k+ context, hardware-friendly | Block-sparse + FlashAttention | | Native long-context pretraining | NSA / MoBA (2025+) | | Inference-only swap | Top-k token sparsification | **기본값**: 매 inference 매 SWA + FlashAttention; 매 pretraining 매 native sparse (NSA-like). ## 🔗 Graph - 부모: [[Attention Mechanism]] · [[Transformer]] - 변형: [[Flash Attention]] · [[Mamba]] - Adjacent: [[KV-Cache]] ## 🤖 LLM 활용 **언제**: 매 attention pattern selection rationale, 매 mask code draft, 매 paper distillation (NSA/MoBA). **언제 X**: 매 production kernel write (use FA-3 / FlexAttention), 매 perf measurement (real benchmark). ## ❌ 안티패턴 - **Naïve mask + softmax**: 매 O(n²) memory still. 매 -inf masking 매 helps compute 의 X. - **Random sparsity only**: 매 quality drop 매 catastrophic. 매 hybrid (window + global) needed. - **Fixed window for all heads**: 매 head 마다 different need. 매 per-head adaptive 의 better. - **No global token**: 매 long-doc QA 매 [CLS]/question token 의 全文 access 의 lose. - **Window too small**: 매 perplexity 매 baseline 의 break. ## 🧪 검증 / 중복 - Verified (Longformer arXiv:2004.05150; BigBird arXiv:2007.14062; FlashAttention-3 2024; DeepSeek NSA 2025; PyTorch FlexAttention). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — full content (SWA/BigBird/NSA + FlashAttention/FlexAttention patterns) |