"매 O(n²) attention 의 O(n·k) 의 reduce — token pair 의 subset 만 compute". Sparse attention 매 long-context Transformer 의 enabler, sliding-window + global tokens (Longformer) 매 base, BigBird/LongNet 매 random + dilated. 매 2026 매 native sparse (DeepSeek NSA, MoBA) + SSM hybrid (Mamba2) + FlashAttention-3 sparse mask 매 production.
매 핵심
매 sparsity pattern
Sliding window: 매 ±w tokens. Local context. (Longformer, Mistral SWA).
Global tokens: 매 [CLS] + 특정 token 매 모든 token 의 attend.
Dilated: 매 stride k. Long-range w/o full O(n²).
Random: 매 random k tokens. BigBird 의 component.
Block sparse: 매 block-diagonal + selected blocks. FlashAttention 친화.
Learned/adaptive: 매 routing network 의 어디 sparse 의 decide (NSA, MoBA 2025).
deflongformer_mask(n:int,w:int,global_idx:list[int],device='cuda'):mask=sliding_window_mask(n,w,device)g=torch.zeros(n,dtype=torch.bool,device=device)g[global_idx]=True# global attends to all + all attend to globalmask=mask|g.unsqueeze(0)|g.unsqueeze(1)returnmask
BigBird random component
defrandom_attention_mask(n:int,k:int,device='cuda')->torch.Tensor:"""Each token attends to k random others."""mask=torch.zeros(n,n,dtype=torch.bool,device=device)foriinrange(n):idx=torch.randperm(n,device=device)[:k]mask[i,idx]=Truereturnmask
언제: 매 attention pattern selection rationale, 매 mask code draft, 매 paper distillation (NSA/MoBA).
언제 X: 매 production kernel write (use FA-3 / FlexAttention), 매 perf measurement (real benchmark).
❌ 안티패턴
Naïve mask + softmax: 매 O(n²) memory still. 매 -inf masking 매 helps compute 의 X.
Random sparsity only: 매 quality drop 매 catastrophic. 매 hybrid (window + global) needed.
Fixed window for all heads: 매 head 마다 different need. 매 per-head adaptive 의 better.
No global token: 매 long-doc QA 매 [CLS]/question token 의 全文 access 의 lose.
Window too small: 매 perplexity 매 baseline 의 break.