2nd/10_Wiki/Topics/AI_and_ML/Sparse Attention.md

---
id: wiki-2026-0508-sparse-attention
title: Sparse Attention
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Sparse Self-Attention, Local Attention, Efficient Attention]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [transformer, attention, long-context, efficient-llm, sparse]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: PyTorch/FlashAttention
---

# Sparse Attention

## 매 한 줄
> **"매 O(n²) attention 의 O(n·k) 의 reduce — token pair 의 subset 만 compute"**. Sparse attention 매 long-context Transformer 의 enabler, sliding-window + global tokens (Longformer) 매 base, BigBird/LongNet 매 random + dilated. 매 2026 매 native sparse (DeepSeek NSA, MoBA) + SSM hybrid (Mamba2) + FlashAttention-3 sparse mask 매 production.

## 매 핵심

### 매 sparsity pattern
- **Sliding window**: 매 ±w tokens. Local context. (Longformer, Mistral SWA).
- **Global tokens**: 매 [CLS] + 특정 token 매 모든 token 의 attend.
- **Dilated**: 매 stride k. Long-range w/o full O(n²).
- **Random**: 매 random k tokens. BigBird 의 component.
- **Block sparse**: 매 block-diagonal + selected blocks. FlashAttention 친화.
- **Learned/adaptive**: 매 routing network 의 어디 sparse 의 decide (NSA, MoBA 2025).

### 매 historical landmarks
- **Sparse Transformer** (OpenAI 2019): factorized attention.
- **Longformer** (AllenAI 2020): SWA + global. 4k→16k+ tokens.
- **BigBird** (Google 2020): random + window + global. Theoretically 의 full-attn approximate.
- **LongNet** (Microsoft 2023): dilated → 1B token claim.
- **NSA** (DeepSeek 2025): native sparse 매 pretraining.
- **MoBA** (Moonshot 2025): mixture-of-block-attention, hierarchical sparsity.

### 매 응용
1. Long-document QA / summarization.
2. Code-base wide LLM analysis (Claude 1M context).
3. Genomics / DNA Transformer.
4. Video Transformer (frames as tokens).

## 💻 패턴

### Sliding window mask
```python
import torch

def sliding_window_mask(n: int, w: int, device='cuda') -> torch.Tensor:
    """Boolean mask: True = allow attention."""
    idx = torch.arange(n, device=device)
    diff = idx.unsqueeze(0) - idx.unsqueeze(1)
    return diff.abs() <= w
```

### Longformer-style (SWA + global)
```python
def longformer_mask(n: int, w: int, global_idx: list[int], device='cuda'):
    mask = sliding_window_mask(n, w, device)
    g = torch.zeros(n, dtype=torch.bool, device=device)
    g[global_idx] = True
    # global attends to all + all attend to global
    mask = mask | g.unsqueeze(0) | g.unsqueeze(1)
    return mask
```

### BigBird random component
```python
def random_attention_mask(n: int, k: int, device='cuda') -> torch.Tensor:
    """Each token attends to k random others."""
    mask = torch.zeros(n, n, dtype=torch.bool, device=device)
    for i in range(n):
        idx = torch.randperm(n, device=device)[:k]
        mask[i, idx] = True
    return mask
```

### FlashAttention-2/3 with custom mask
```python
from flash_attn import flash_attn_func
# (b, s, h, d) — fp16/bf16
# FA-3 supports block-sparse mask via mask_mod (PyTorch 2.5+ FlexAttention)
out = flash_attn_func(q, k, v, causal=True, window_size=(512, 0))
# window_size=(left, right) — Mistral-style SWA
```

### FlexAttention (PyTorch 2.5+)
```python
from torch.nn.attention.flex_attention import flex_attention, create_block_mask

def sliding_window(b, h, q_idx, kv_idx):
    return (q_idx - kv_idx).abs() <= 512

block_mask = create_block_mask(sliding_window, B=None, H=None, Q_LEN=8192, KV_LEN=8192)
out = flex_attention(q, k, v, block_mask=block_mask)
```

### Block-sparse (DeepSeek NSA pseudo)
```python
def block_sparse_attn(q, k, v, block_size=64, top_k_blocks=8):
    # 1. Compute block-level importance via mean-pooled K
    n_blocks = k.shape[1] // block_size
    k_blocks = k.view(*k.shape[:1], n_blocks, block_size, *k.shape[2:]).mean(dim=2)
    scores = torch.einsum('bnhd,bmhd->bnmh', q, k_blocks)
    # 2. Select top-k blocks per query
    _, top_idx = scores.topk(top_k_blocks, dim=2)
    # 3. Gather + dense attn within
    return _gather_and_attend(q, k, v, top_idx, block_size)
```

### Mistral SWA (HuggingFace)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    sliding_window=4096,
)
```

### Adaptive top-k token (Native Sparse)
```python
def topk_attention(q, k, v, top_k=128):
    # (b, h, s, d)
    scores = q @ k.transpose(-2, -1) / q.shape[-1]**0.5
    # Keep top_k per query
    top_v, top_i = scores.topk(top_k, dim=-1)
    sparse = torch.full_like(scores, float('-inf'))
    sparse.scatter_(-1, top_i, top_v)
    attn = sparse.softmax(dim=-1)
    return attn @ v
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| 4-32k context, local-mostly | Sliding window (Mistral SWA) |
| Long-doc QA w/ key-token | Longformer (SWA + global) |
| 100k+ context, hardware-friendly | Block-sparse + FlashAttention |
| Native long-context pretraining | NSA / MoBA (2025+) |
| Inference-only swap | Top-k token sparsification |

**기본값**: 매 inference 매 SWA + FlashAttention; 매 pretraining 매 native sparse (NSA-like).

## 🔗 Graph
- 부모: [[Attention Mechanism]] · [[Transformer]]
- 변형: [[Flash Attention]] · [[Mamba]]
- Adjacent: [[KV-Cache]]

## 🤖 LLM 활용
**언제**: 매 attention pattern selection rationale, 매 mask code draft, 매 paper distillation (NSA/MoBA).
**언제 X**: 매 production kernel write (use FA-3 / FlexAttention), 매 perf measurement (real benchmark).

## ❌ 안티패턴
- **Naïve mask + softmax**: 매 O(n²) memory still. 매 -inf masking 매 helps compute 의 X.
- **Random sparsity only**: 매 quality drop 매 catastrophic. 매 hybrid (window + global) needed.
- **Fixed window for all heads**: 매 head 마다 different need. 매 per-head adaptive 의 better.
- **No global token**: 매 long-doc QA 매 [CLS]/question token 의 全文 access 의 lose.
- **Window too small**: 매 perplexity 매 baseline 의 break.

## 🧪 검증 / 중복
- Verified (Longformer arXiv:2004.05150; BigBird arXiv:2007.14062; FlashAttention-3 2024; DeepSeek NSA 2025; PyTorch FlexAttention).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — full content (SWA/BigBird/NSA + FlashAttention/FlexAttention patterns) |