f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
230 lines
7.5 KiB
Markdown
230 lines
7.5 KiB
Markdown
---
|
||
id: wiki-2026-0508-sparse-data-handling
|
||
title: Sparse Data Handling
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [Sparse Matrix, Sparse Features, Missing Data, Imputation]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.92
|
||
verification_status: applied
|
||
tags: [sparse, data-engineering, ml, imputation, scipy]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: Python
|
||
framework: scipy/scikit-learn/PyTorch
|
||
---
|
||
|
||
# Sparse Data Handling
|
||
|
||
## 매 한 줄
|
||
> **"매 zero/missing 매 majority 일 때 — storage + algorithm 의 sparse-aware 의 switch"**. Sparse data handling 매 high-cardinality categorical (one-hot, TF-IDF, click-stream) + missing-value imputation + sparse model (L1, factorization machine) 의 수렴. 매 2026 매 scipy.sparse + cuSPARSE / PyTorch sparse + Polars LazyFrame 매 production stack.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 2 axes 의 "sparse"
|
||
1. **Structurally sparse**: 매 most cell = 0 by definition (one-hot, adjacency, term-doc matrix).
|
||
2. **Missing**: 매 NaN/None, 매 not-recorded. 매 imputation needed.
|
||
- 매 두 매 different problem. Conflate 의 X.
|
||
|
||
### 매 sparse format (scipy)
|
||
- **CSR** (Compressed Sparse Row): row slice fast, matrix-vector fast. 매 ML 의 default.
|
||
- **CSC**: column slice fast. 매 column-wise stat.
|
||
- **COO**: build-time. 매 fast construct, slow op.
|
||
- **DOK / LIL**: incremental build.
|
||
- **BSR**: block-sparse. 매 GPU friendly.
|
||
|
||
### 매 sparse-aware models
|
||
- **Linear w/ L1 (Lasso)**: 매 sparse output. SGD/coordinate-descent.
|
||
- **Logistic regression**: liblinear / saga 매 sparse input handle.
|
||
- **Tree (XGBoost/LightGBM)**: 매 missing 매 native split direction 의 learn.
|
||
- **Factorization Machines** (libfm, xLearn): 매 high-card categorical.
|
||
- **HashingVectorizer**: 매 fixed-dim feature hashing.
|
||
|
||
### 매 missing-value strategies
|
||
1. **Drop**: row/col 매 sparse-too-much 일 때.
|
||
2. **Constant fill**: 0 / mean / median / mode.
|
||
3. **Indicator + fill**: 매 missingness 의 carry as feature.
|
||
4. **KNN impute**: 매 small data.
|
||
5. **Iterative (MICE)**: 매 chained regression. sklearn `IterativeImputer`.
|
||
6. **Tree-native**: 매 LightGBM/XGBoost 매 NaN 의 직접 handle.
|
||
7. **Deep**: 매 VAE-imputation, GAIN.
|
||
|
||
### 매 응용
|
||
1. NLP (TF-IDF, BoW).
|
||
2. Recommender (user-item interaction).
|
||
3. Genomics (one-hot variant matrix).
|
||
4. Click-stream / session data.
|
||
5. Tabular ML w/ missing fields.
|
||
|
||
## 💻 패턴
|
||
|
||
### scipy.sparse construction
|
||
```python
|
||
from scipy.sparse import csr_matrix, coo_matrix
|
||
import numpy as np
|
||
|
||
# COO build then convert
|
||
rows = np.array([0, 1, 2, 0])
|
||
cols = np.array([0, 2, 1, 3])
|
||
data = np.array([1.0, 2.0, 3.0, 4.0])
|
||
M = coo_matrix((data, (rows, cols)), shape=(3, 4)).tocsr()
|
||
|
||
# Slice / op
|
||
print(M.shape, M.nnz, M.density if hasattr(M, 'density') else M.nnz / np.prod(M.shape))
|
||
print(M[0]) # row slice
|
||
print(M @ M.T) # sparse @ sparse
|
||
```
|
||
|
||
### sklearn sparse pipeline
|
||
```python
|
||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||
from sklearn.linear_model import LogisticRegression
|
||
from sklearn.pipeline import Pipeline
|
||
|
||
pipe = Pipeline([
|
||
("tfidf", TfidfVectorizer(max_features=200_000, ngram_range=(1,2))),
|
||
("clf", LogisticRegression(solver="liblinear", penalty="l1", C=1.0)),
|
||
])
|
||
pipe.fit(texts_train, y_train)
|
||
# tfidf returns CSR; LR liblinear handles sparse natively
|
||
```
|
||
|
||
### HashingVectorizer (no vocab, online)
|
||
```python
|
||
from sklearn.feature_extraction.text import HashingVectorizer
|
||
hv = HashingVectorizer(n_features=2**20, alternate_sign=False)
|
||
X = hv.transform(stream_of_docs) # CSR
|
||
```
|
||
|
||
### Iterative imputation (MICE)
|
||
```python
|
||
from sklearn.experimental import enable_iterative_imputer # noqa
|
||
from sklearn.impute import IterativeImputer
|
||
from sklearn.ensemble import HistGradientBoostingRegressor
|
||
|
||
imp = IterativeImputer(
|
||
estimator=HistGradientBoostingRegressor(),
|
||
max_iter=10,
|
||
random_state=0,
|
||
)
|
||
X_filled = imp.fit_transform(X_with_nan)
|
||
```
|
||
|
||
### Missing indicator + fill
|
||
```python
|
||
from sklearn.compose import ColumnTransformer
|
||
from sklearn.impute import SimpleImputer, MissingIndicator
|
||
from sklearn.pipeline import FeatureUnion
|
||
|
||
union = FeatureUnion([
|
||
("fill_median", SimpleImputer(strategy="median")),
|
||
("ind", MissingIndicator(features="all")),
|
||
])
|
||
X = union.fit_transform(X_raw)
|
||
```
|
||
|
||
### LightGBM native NaN handling
|
||
```python
|
||
import lightgbm as lgb
|
||
# No imputation needed
|
||
model = lgb.LGBMClassifier(
|
||
n_estimators=500,
|
||
learning_rate=0.05,
|
||
num_leaves=63,
|
||
use_missing=True,
|
||
zero_as_missing=False,
|
||
)
|
||
model.fit(X_train_with_nan, y_train)
|
||
```
|
||
|
||
### PyTorch sparse tensor
|
||
```python
|
||
import torch
|
||
|
||
i = torch.tensor([[0, 1, 2], [2, 0, 1]])
|
||
v = torch.tensor([3.0, 4.0, 5.0])
|
||
sp = torch.sparse_coo_tensor(i, v, (3, 3)).coalesce()
|
||
sp_csr = sp.to_sparse_csr()
|
||
out = torch.sparse.mm(sp_csr, dense_matrix)
|
||
```
|
||
|
||
### cuSPARSE-backed sparse @ dense (GPU)
|
||
```python
|
||
import cupy as cp
|
||
from cupyx.scipy.sparse import csr_matrix as cp_csr
|
||
|
||
X_gpu = cp_csr(cp.array(X.data), ...) # or cp_csr.from_scipy
|
||
y = X_gpu @ w_gpu
|
||
```
|
||
|
||
### Polars lazy missing handling
|
||
```python
|
||
import polars as pl
|
||
|
||
df = (
|
||
pl.scan_parquet("events/*.parquet")
|
||
.with_columns([
|
||
pl.col("price").fill_null(strategy="median").alias("price"),
|
||
pl.col("category").fill_null("unknown"),
|
||
pl.col("price").is_null().cast(pl.Int8).alias("price_was_missing"),
|
||
])
|
||
.collect()
|
||
)
|
||
```
|
||
|
||
### Sparsity diagnostics
|
||
```python
|
||
def sparsity_report(X):
|
||
if hasattr(X, "nnz"):
|
||
density = X.nnz / (X.shape[0] * X.shape[1])
|
||
else:
|
||
import numpy as np
|
||
density = np.count_nonzero(X) / X.size
|
||
print(f"shape={X.shape}, density={density:.4%}, sparse={1-density:.4%}")
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 상황 | Approach |
|
||
|---|---|
|
||
| Density < 5%, structurally sparse | scipy CSR + sparse-aware model |
|
||
| Missing < 5% | Drop or median fill |
|
||
| Missing 5-30% | Indicator + iterative impute |
|
||
| Missing > 30% on key col | Drop column or model NaN-native (LightGBM) |
|
||
| Tabular w/ NaN | LightGBM/XGBoost (native) |
|
||
| Online stream | HashingVectorizer + SGDClassifier |
|
||
| GPU @ scale | cuSPARSE / torch.sparse |
|
||
|
||
**기본값**: scipy CSR + sklearn pipeline; tabular missing 매 LightGBM native.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[Data-Engineering]] · [[Feature Engineering|Feature-Engineering]]
|
||
- 변형: [[Sparse-Matrix]] · [[Imputation]]
|
||
- 응용: [[TF-IDF]] · [[One-Hot-Encoding]] · [[Recommender-Systems]]
|
||
- Adjacent: [[Lasso]] · [[LightGBM]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: 매 imputation strategy rationale, 매 sparse pipeline scaffolding, 매 missingness mechanism (MCAR/MAR/MNAR) 의 explain.
|
||
**언제 X**: 매 numerical impute (use IterativeImputer/MICE), 매 distribution test (statsmodels).
|
||
|
||
## ❌ 안티패턴
|
||
- **Sparse → dense conversion**: 매 OOM 의 instant. `X.toarray()` 매 N×D > 10^9 매 ban.
|
||
- **Mean-fill skewed dist**: 매 long-tail 매 distort. 매 median 의 default.
|
||
- **Drop NaN row 매 30% lose**: 매 power loss. 매 indicator + impute.
|
||
- **Imputation 의 leakage**: 매 train+test 동시 fit 매 leak. 매 train-only fit, transform both.
|
||
- **MNAR ignored**: 매 missing-not-at-random 매 imputation 의 bias. 매 indicator 의 critical.
|
||
- **Sparse + StandardScaler(with_mean=True)**: 매 densify. 매 `with_mean=False`.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (scipy.sparse docs 1.14; sklearn IterativeImputer; LightGBM missing-value docs; Little & Rubin "Statistical Analysis with Missing Data" 3rd ed).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — full content (CSR/MICE/LightGBM/cuSPARSE patterns + missingness taxonomy) |
|