"매 zero/missing 매 majority 일 때 — storage + algorithm 의 sparse-aware 의 switch". Sparse data handling 매 high-cardinality categorical (one-hot, TF-IDF, click-stream) + missing-value imputation + sparse model (L1, factorization machine) 의 수렴. 매 2026 매 scipy.sparse + cuSPARSE / PyTorch sparse + Polars LazyFrame 매 production stack.
매 핵심
매 2 axes 의 "sparse"
Structurally sparse: 매 most cell = 0 by definition (one-hot, adjacency, term-doc matrix).
Missing: 매 NaN/None, 매 not-recorded. 매 imputation needed.
매 두 매 different problem. Conflate 의 X.
매 sparse format (scipy)
CSR (Compressed Sparse Row): row slice fast, matrix-vector fast. 매 ML 의 default.
CSC: column slice fast. 매 column-wise stat.
COO: build-time. 매 fast construct, slow op.
DOK / LIL: incremental build.
BSR: block-sparse. 매 GPU friendly.
매 sparse-aware models
Linear w/ L1 (Lasso): 매 sparse output. SGD/coordinate-descent.
Logistic regression: liblinear / saga 매 sparse input handle.
Tree (XGBoost/LightGBM): 매 missing 매 native split direction 의 learn.
Factorization Machines (libfm, xLearn): 매 high-card categorical.
HashingVectorizer: 매 fixed-dim feature hashing.
매 missing-value strategies
Drop: row/col 매 sparse-too-much 일 때.
Constant fill: 0 / mean / median / mode.
Indicator + fill: 매 missingness 의 carry as feature.
KNN impute: 매 small data.
Iterative (MICE): 매 chained regression. sklearn IterativeImputer.
Tree-native: 매 LightGBM/XGBoost 매 NaN 의 직접 handle.
importlightgbmaslgb# No imputation neededmodel=lgb.LGBMClassifier(n_estimators=500,learning_rate=0.05,num_leaves=63,use_missing=True,zero_as_missing=False,)model.fit(X_train_with_nan,y_train)
언제: 매 imputation strategy rationale, 매 sparse pipeline scaffolding, 매 missingness mechanism (MCAR/MAR/MNAR) 의 explain.
언제 X: 매 numerical impute (use IterativeImputer/MICE), 매 distribution test (statsmodels).
❌ 안티패턴
Sparse → dense conversion: 매 OOM 의 instant. X.toarray() 매 N×D > 10^9 매 ban.
Mean-fill skewed dist: 매 long-tail 매 distort. 매 median 의 default.
Drop NaN row 매 30% lose: 매 power loss. 매 indicator + impute.
Imputation 의 leakage: 매 train+test 동시 fit 매 leak. 매 train-only fit, transform both.
MNAR ignored: 매 missing-not-at-random 매 imputation 의 bias. 매 indicator 의 critical.
Sparse + StandardScaler(with_mean=True): 매 densify. 매 with_mean=False.
🧪 검증 / 중복
Verified (scipy.sparse docs 1.14; sklearn IterativeImputer; LightGBM missing-value docs; Little & Rubin "Statistical Analysis with Missing Data" 3rd ed).
신뢰도 A.
🕓 Changelog
날짜
변경
2026-05-08
Phase 1
2026-05-10
Manual cleanup — full content (CSR/MICE/LightGBM/cuSPARSE patterns + missingness taxonomy)