"매 attention 의 sequence 의 N 의 device 의 ring 의 split — context length scales linearly with devices.". Liu, Zaharia, Abbeel 2023 ("Ring Attention with Blockwise Transformers") 의 propose, 매 1M+ context window (Gemini 1.5 Pro, Claude Opus 4.7 1M) 의 training-time enabler 의, 매 communication overlap with compute 의 near-zero overhead.
매 핵심
매 핵심 idea
Sequence 의 N device 의 split (each device holds 1/N tokens of Q, K, V).
Each device computes attention with its local Q against rotating K, V blocks.
K, V blocks travel ring N steps; communication 의 attention compute 와 overlap.
Result: full sequence attention 의 device 의 N 배 의 longer context 의 fit.
매 vs alternatives
Flash Attention: single device, IO-aware, memory-efficient. Ring composes on top.
Sequence Parallel (Megatron): similar split but layernorm/dropout only.
Context Parallel (Megatron 2024): industrial Ring Attention variant.
Striped Attention (2023): improved load balance for causal masks.
매 응용
1M+ context LLM training (Gemini 1.5/2.0, Claude Opus 4.x).
Long video understanding.
Whole-codebase code models.
Long DNA sequence models (Evo).
💻 패턴
Conceptual Ring Loop (single block)
importtorchimporttorch.distributedasdistdefring_attention_step(q_local,kv_local,world_size):"""매 simplified single-pass illustration."""out=torch.zeros_like(q_local)lse=torch.full(q_local.shape[:-1],-float("inf"),device=q_local.device)k,v=kv_localrank=dist.get_rank()forstepinrange(world_size):# local attention partialpartial_out,partial_lse=blockwise_attention(q_local,k,v)out,lse=online_softmax_merge(out,lse,partial_out,partial_lse)# rotate K, V to next neighbor (overlap with next compute)send_rank=(rank-1)%world_sizerecv_rank=(rank+1)%world_sizek,v=ring_send_recv(k,v,send_rank,recv_rank)returnout
Online Softmax Merge
defonline_softmax_merge(out_a,lse_a,out_b,lse_b):"""매 numerically stable merge of 2 partial attention results."""m=torch.maximum(lse_a,lse_b)c_a=torch.exp(lse_a-m).unsqueeze(-1)c_b=torch.exp(lse_b-m).unsqueeze(-1)out=(c_a*out_a+c_b*out_b)/(c_a+c_b)new_lse=m+torch.log(torch.exp(lse_a-m)+torch.exp(lse_b-m))returnout,new_lse
defstriped_block_order(seq_len,world_size,block_size):"""매 causal mask 의 load balance 의 — interleave 의 X stride."""n_blocks=seq_len//block_sizereturn[(i*world_size+r)%n_blocksforrinrange(world_size)foriinrange(n_blocks//world_size)]
Causal Mask Skip Optimization
defshould_compute(q_block_idx,kv_block_idx,causal=True):"""매 causal: skip 의 kv > q (future)."""return(notcausal)orkv_block_idx<=q_block_idx