f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
138 lines
4.5 KiB
Markdown
138 lines
4.5 KiB
Markdown
---
|
|
id: wiki-2026-0508-selective-state-space-models-mam
|
|
title: Selective State Space Models (Mamba)
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Mamba, S6, Selective SSM, State Space Model]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [architecture, ssm, sequence-modeling, llm]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: python
|
|
framework: PyTorch / mamba-ssm
|
|
---
|
|
|
|
# Selective State Space Models (Mamba)
|
|
|
|
## 매 한 줄
|
|
> **"매 hidden state 가 input 에 따라 selectively update"**. 매 Gu & Dao (2023) 의 Mamba — S4 의 시간-불변 한계를 깬 selective scan (S6). 매 linear-time sequence modeling, Transformer 와 경쟁 가능한 long-context 효율. 매 2026: Mamba-2, hybrid Transformer-Mamba (Jamba, Zamba2) 가 prod 진입.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 SSM 기초
|
|
- Continuous: x'(t) = Ax(t) + Bu(t), y(t) = Cx(t).
|
|
- Discretized (zero-order hold): xₖ = Āxₖ₋₁ + B̄uₖ.
|
|
- S4: A는 HiPPO-init, time-invariant → 매 efficient FFT convolution.
|
|
|
|
### 매 Selective (S6)
|
|
- B, C, Δ를 input-dependent function. 매 매 token마다 dynamic.
|
|
- FFT 못 씀 → 매 hardware-aware parallel scan (kernel fusion, SRAM).
|
|
- Benefit: 매 selective recall, copying, induction 가능 (S4 못함).
|
|
|
|
### 매 vs Transformer
|
|
- Compute: O(L) vs O(L²). 매 long context 큰 advantage.
|
|
- Memory: constant state vs KV cache. 매 inference 매우 cheap.
|
|
- Quality: 7B scale 비슷, 14B+ Transformer slight edge — 매 hybrid 가 sweet spot.
|
|
|
|
### 매 응용
|
|
1. Long-context LLM (Codestral Mamba, Jamba 1.5, Zamba2).
|
|
2. Genomic sequence (HyenaDNA → Caduceus → Evo).
|
|
3. Audio / time series.
|
|
4. State tracking, retrieval (induction heads).
|
|
|
|
## 💻 패턴
|
|
|
|
### Mamba block 사용 (mamba-ssm)
|
|
```python
|
|
from mamba_ssm import Mamba
|
|
import torch
|
|
|
|
block = Mamba(d_model=1024, d_state=16, d_conv=4, expand=2).cuda()
|
|
x = torch.randn(2, 4096, 1024).cuda()
|
|
y = block(x) # (2, 4096, 1024), O(L)
|
|
```
|
|
|
|
### Selective scan (toy)
|
|
```python
|
|
def selective_scan(u, delta, A, B, C):
|
|
# u:(B,L,D), delta:(B,L,D), A:(D,N), B,C:(B,L,N)
|
|
dA = torch.exp(delta.unsqueeze(-1) * A) # discretize
|
|
dB = delta.unsqueeze(-1) * B.unsqueeze(2)
|
|
x = torch.zeros(u.shape[0], u.shape[2], A.shape[1], device=u.device)
|
|
ys = []
|
|
for t in range(u.shape[1]):
|
|
x = dA[:, t] * x + dB[:, t] * u[:, t].unsqueeze(-1)
|
|
ys.append((x * C[:, t].unsqueeze(1)).sum(-1))
|
|
return torch.stack(ys, dim=1)
|
|
```
|
|
|
|
### Mamba-2 block (SSD)
|
|
```python
|
|
from mamba_ssm import Mamba2
|
|
b = Mamba2(d_model=2048, d_state=128, d_conv=4, expand=2, headdim=64).cuda()
|
|
```
|
|
|
|
### Hybrid stack (Jamba-style)
|
|
```python
|
|
class HybridLayer(nn.Module):
|
|
def __init__(self, d, attn_every=4, idx=0):
|
|
super().__init__()
|
|
self.use_attn = (idx % attn_every) == 0
|
|
self.mix = nn.MultiheadAttention(d, 8, batch_first=True) if self.use_attn else Mamba(d_model=d)
|
|
self.ffn = SwiGLU(d)
|
|
def forward(self, x):
|
|
h = self.mix(x, x, x)[0] if self.use_attn else self.mix(x)
|
|
return self.ffn(x + h)
|
|
```
|
|
|
|
### 1M context inference
|
|
```python
|
|
# Mamba: KV cache 없음 → constant memory
|
|
model.eval()
|
|
with torch.no_grad():
|
|
state = None
|
|
for chunk in chunks_of_1M_tokens:
|
|
out, state = model.step(chunk, state)
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Long context (>32k) inference cost critical | Mamba / Jamba |
|
|
| Need strong in-context reasoning | Transformer or Hybrid |
|
|
| Genomic / audio million-length | Mamba family |
|
|
| Standard chat 8k context | Transformer (matured tooling) |
|
|
| Edge device, low memory | Mamba (no KV cache) |
|
|
|
|
**기본값**: Hybrid (Jamba/Zamba2) — 매 best of both.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[State-Space|State-Space-Models]] · [[Sequence-to-Sequence-Models]]
|
|
- 변형: [[S4]]
|
|
- Adjacent: [[Transformer]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매우 긴 context, streaming, 매 inference 비용 critical. Genomic / audio.
|
|
**언제 X**: 매 strong needle-in-haystack recall — pure Mamba 약함, hybrid 필요.
|
|
|
|
## ❌ 안티패턴
|
|
- **Pure Mamba for retrieval**: induction OK 지만 exact recall 매 약함.
|
|
- **Naive scan implementation**: SRAM-aware kernel 없으면 매 felt slower than attention.
|
|
- **S4 (non-selective)** for LLM: 매 obsoleted by S6/Mamba.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Gu & Dao 2023 "Mamba", Mamba-2 2024, Jamba 2024).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — Mamba/Mamba-2/hybrid 2026 state |
|