Files
2nd/10_Wiki/Topics/AI_and_ML/Sparse-Data-Handling.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

230 lines
7.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-sparse-data-handling
title: Sparse Data Handling
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Sparse Matrix, Sparse Features, Missing Data, Imputation]
duplicate_of: none
source_trust_level: A
confidence_score: 0.92
verification_status: applied
tags: [sparse, data-engineering, ml, imputation, scipy]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: scipy/scikit-learn/PyTorch
---
# Sparse Data Handling
## 매 한 줄
> **"매 zero/missing 매 majority 일 때 — storage + algorithm 의 sparse-aware 의 switch"**. Sparse data handling 매 high-cardinality categorical (one-hot, TF-IDF, click-stream) + missing-value imputation + sparse model (L1, factorization machine) 의 수렴. 매 2026 매 scipy.sparse + cuSPARSE / PyTorch sparse + Polars LazyFrame 매 production stack.
## 매 핵심
### 매 2 axes 의 "sparse"
1. **Structurally sparse**: 매 most cell = 0 by definition (one-hot, adjacency, term-doc matrix).
2. **Missing**: 매 NaN/None, 매 not-recorded. 매 imputation needed.
- 매 두 매 different problem. Conflate 의 X.
### 매 sparse format (scipy)
- **CSR** (Compressed Sparse Row): row slice fast, matrix-vector fast. 매 ML 의 default.
- **CSC**: column slice fast. 매 column-wise stat.
- **COO**: build-time. 매 fast construct, slow op.
- **DOK / LIL**: incremental build.
- **BSR**: block-sparse. 매 GPU friendly.
### 매 sparse-aware models
- **Linear w/ L1 (Lasso)**: 매 sparse output. SGD/coordinate-descent.
- **Logistic regression**: liblinear / saga 매 sparse input handle.
- **Tree (XGBoost/LightGBM)**: 매 missing 매 native split direction 의 learn.
- **Factorization Machines** (libfm, xLearn): 매 high-card categorical.
- **HashingVectorizer**: 매 fixed-dim feature hashing.
### 매 missing-value strategies
1. **Drop**: row/col 매 sparse-too-much 일 때.
2. **Constant fill**: 0 / mean / median / mode.
3. **Indicator + fill**: 매 missingness 의 carry as feature.
4. **KNN impute**: 매 small data.
5. **Iterative (MICE)**: 매 chained regression. sklearn `IterativeImputer`.
6. **Tree-native**: 매 LightGBM/XGBoost 매 NaN 의 직접 handle.
7. **Deep**: 매 VAE-imputation, GAIN.
### 매 응용
1. NLP (TF-IDF, BoW).
2. Recommender (user-item interaction).
3. Genomics (one-hot variant matrix).
4. Click-stream / session data.
5. Tabular ML w/ missing fields.
## 💻 패턴
### scipy.sparse construction
```python
from scipy.sparse import csr_matrix, coo_matrix
import numpy as np
# COO build then convert
rows = np.array([0, 1, 2, 0])
cols = np.array([0, 2, 1, 3])
data = np.array([1.0, 2.0, 3.0, 4.0])
M = coo_matrix((data, (rows, cols)), shape=(3, 4)).tocsr()
# Slice / op
print(M.shape, M.nnz, M.density if hasattr(M, 'density') else M.nnz / np.prod(M.shape))
print(M[0]) # row slice
print(M @ M.T) # sparse @ sparse
```
### sklearn sparse pipeline
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("tfidf", TfidfVectorizer(max_features=200_000, ngram_range=(1,2))),
("clf", LogisticRegression(solver="liblinear", penalty="l1", C=1.0)),
])
pipe.fit(texts_train, y_train)
# tfidf returns CSR; LR liblinear handles sparse natively
```
### HashingVectorizer (no vocab, online)
```python
from sklearn.feature_extraction.text import HashingVectorizer
hv = HashingVectorizer(n_features=2**20, alternate_sign=False)
X = hv.transform(stream_of_docs) # CSR
```
### Iterative imputation (MICE)
```python
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer
from sklearn.ensemble import HistGradientBoostingRegressor
imp = IterativeImputer(
estimator=HistGradientBoostingRegressor(),
max_iter=10,
random_state=0,
)
X_filled = imp.fit_transform(X_with_nan)
```
### Missing indicator + fill
```python
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.pipeline import FeatureUnion
union = FeatureUnion([
("fill_median", SimpleImputer(strategy="median")),
("ind", MissingIndicator(features="all")),
])
X = union.fit_transform(X_raw)
```
### LightGBM native NaN handling
```python
import lightgbm as lgb
# No imputation needed
model = lgb.LGBMClassifier(
n_estimators=500,
learning_rate=0.05,
num_leaves=63,
use_missing=True,
zero_as_missing=False,
)
model.fit(X_train_with_nan, y_train)
```
### PyTorch sparse tensor
```python
import torch
i = torch.tensor([[0, 1, 2], [2, 0, 1]])
v = torch.tensor([3.0, 4.0, 5.0])
sp = torch.sparse_coo_tensor(i, v, (3, 3)).coalesce()
sp_csr = sp.to_sparse_csr()
out = torch.sparse.mm(sp_csr, dense_matrix)
```
### cuSPARSE-backed sparse @ dense (GPU)
```python
import cupy as cp
from cupyx.scipy.sparse import csr_matrix as cp_csr
X_gpu = cp_csr(cp.array(X.data), ...) # or cp_csr.from_scipy
y = X_gpu @ w_gpu
```
### Polars lazy missing handling
```python
import polars as pl
df = (
pl.scan_parquet("events/*.parquet")
.with_columns([
pl.col("price").fill_null(strategy="median").alias("price"),
pl.col("category").fill_null("unknown"),
pl.col("price").is_null().cast(pl.Int8).alias("price_was_missing"),
])
.collect()
)
```
### Sparsity diagnostics
```python
def sparsity_report(X):
if hasattr(X, "nnz"):
density = X.nnz / (X.shape[0] * X.shape[1])
else:
import numpy as np
density = np.count_nonzero(X) / X.size
print(f"shape={X.shape}, density={density:.4%}, sparse={1-density:.4%}")
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Density < 5%, structurally sparse | scipy CSR + sparse-aware model |
| Missing < 5% | Drop or median fill |
| Missing 5-30% | Indicator + iterative impute |
| Missing > 30% on key col | Drop column or model NaN-native (LightGBM) |
| Tabular w/ NaN | LightGBM/XGBoost (native) |
| Online stream | HashingVectorizer + SGDClassifier |
| GPU @ scale | cuSPARSE / torch.sparse |
**기본값**: scipy CSR + sklearn pipeline; tabular missing 매 LightGBM native.
## 🔗 Graph
- 부모: [[Data-Engineering]] · [[Feature Engineering|Feature-Engineering]]
- 변형: [[Sparse-Matrix]] · [[Imputation]]
- 응용: [[TF-IDF]] · [[One-Hot-Encoding]] · [[Recommender-Systems]]
- Adjacent: [[Lasso]] · [[LightGBM]]
## 🤖 LLM 활용
**언제**: 매 imputation strategy rationale, 매 sparse pipeline scaffolding, 매 missingness mechanism (MCAR/MAR/MNAR) 의 explain.
**언제 X**: 매 numerical impute (use IterativeImputer/MICE), 매 distribution test (statsmodels).
## ❌ 안티패턴
- **Sparse → dense conversion**: 매 OOM 의 instant. `X.toarray()` 매 N×D > 10^9 매 ban.
- **Mean-fill skewed dist**: 매 long-tail 매 distort. 매 median 의 default.
- **Drop NaN row 매 30% lose**: 매 power loss. 매 indicator + impute.
- **Imputation 의 leakage**: 매 train+test 동시 fit 매 leak. 매 train-only fit, transform both.
- **MNAR ignored**: 매 missing-not-at-random 매 imputation 의 bias. 매 indicator 의 critical.
- **Sparse + StandardScaler(with_mean=True)**: 매 densify. 매 `with_mean=False`.
## 🧪 검증 / 중복
- Verified (scipy.sparse docs 1.14; sklearn IterativeImputer; LightGBM missing-value docs; Little & Rubin "Statistical Analysis with Missing Data" 3rd ed).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — full content (CSR/MICE/LightGBM/cuSPARSE patterns + missingness taxonomy) |