--- id: wiki-2026-0508-sparse-data-handling title: Sparse Data Handling category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Sparse Matrix, Sparse Features, Missing Data, Imputation] duplicate_of: none source_trust_level: A confidence_score: 0.92 verification_status: applied tags: [sparse, data-engineering, ml, imputation, scipy] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: scipy/scikit-learn/PyTorch --- # Sparse Data Handling ## 매 한 줄 > **"매 zero/missing 매 majority 일 때 — storage + algorithm 의 sparse-aware 의 switch"**. Sparse data handling 매 high-cardinality categorical (one-hot, TF-IDF, click-stream) + missing-value imputation + sparse model (L1, factorization machine) 의 수렴. 매 2026 매 scipy.sparse + cuSPARSE / PyTorch sparse + Polars LazyFrame 매 production stack. ## 매 핵심 ### 매 2 axes 의 "sparse" 1. **Structurally sparse**: 매 most cell = 0 by definition (one-hot, adjacency, term-doc matrix). 2. **Missing**: 매 NaN/None, 매 not-recorded. 매 imputation needed. - 매 두 매 different problem. Conflate 의 X. ### 매 sparse format (scipy) - **CSR** (Compressed Sparse Row): row slice fast, matrix-vector fast. 매 ML 의 default. - **CSC**: column slice fast. 매 column-wise stat. - **COO**: build-time. 매 fast construct, slow op. - **DOK / LIL**: incremental build. - **BSR**: block-sparse. 매 GPU friendly. ### 매 sparse-aware models - **Linear w/ L1 (Lasso)**: 매 sparse output. SGD/coordinate-descent. - **Logistic regression**: liblinear / saga 매 sparse input handle. - **Tree (XGBoost/LightGBM)**: 매 missing 매 native split direction 의 learn. - **Factorization Machines** (libfm, xLearn): 매 high-card categorical. - **HashingVectorizer**: 매 fixed-dim feature hashing. ### 매 missing-value strategies 1. **Drop**: row/col 매 sparse-too-much 일 때. 2. **Constant fill**: 0 / mean / median / mode. 3. **Indicator + fill**: 매 missingness 의 carry as feature. 4. **KNN impute**: 매 small data. 5. **Iterative (MICE)**: 매 chained regression. sklearn `IterativeImputer`. 6. **Tree-native**: 매 LightGBM/XGBoost 매 NaN 의 직접 handle. 7. **Deep**: 매 VAE-imputation, GAIN. ### 매 응용 1. NLP (TF-IDF, BoW). 2. Recommender (user-item interaction). 3. Genomics (one-hot variant matrix). 4. Click-stream / session data. 5. Tabular ML w/ missing fields. ## 💻 패턴 ### scipy.sparse construction ```python from scipy.sparse import csr_matrix, coo_matrix import numpy as np # COO build then convert rows = np.array([0, 1, 2, 0]) cols = np.array([0, 2, 1, 3]) data = np.array([1.0, 2.0, 3.0, 4.0]) M = coo_matrix((data, (rows, cols)), shape=(3, 4)).tocsr() # Slice / op print(M.shape, M.nnz, M.density if hasattr(M, 'density') else M.nnz / np.prod(M.shape)) print(M[0]) # row slice print(M @ M.T) # sparse @ sparse ``` ### sklearn sparse pipeline ```python from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline pipe = Pipeline([ ("tfidf", TfidfVectorizer(max_features=200_000, ngram_range=(1,2))), ("clf", LogisticRegression(solver="liblinear", penalty="l1", C=1.0)), ]) pipe.fit(texts_train, y_train) # tfidf returns CSR; LR liblinear handles sparse natively ``` ### HashingVectorizer (no vocab, online) ```python from sklearn.feature_extraction.text import HashingVectorizer hv = HashingVectorizer(n_features=2**20, alternate_sign=False) X = hv.transform(stream_of_docs) # CSR ``` ### Iterative imputation (MICE) ```python from sklearn.experimental import enable_iterative_imputer # noqa from sklearn.impute import IterativeImputer from sklearn.ensemble import HistGradientBoostingRegressor imp = IterativeImputer( estimator=HistGradientBoostingRegressor(), max_iter=10, random_state=0, ) X_filled = imp.fit_transform(X_with_nan) ``` ### Missing indicator + fill ```python from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer, MissingIndicator from sklearn.pipeline import FeatureUnion union = FeatureUnion([ ("fill_median", SimpleImputer(strategy="median")), ("ind", MissingIndicator(features="all")), ]) X = union.fit_transform(X_raw) ``` ### LightGBM native NaN handling ```python import lightgbm as lgb # No imputation needed model = lgb.LGBMClassifier( n_estimators=500, learning_rate=0.05, num_leaves=63, use_missing=True, zero_as_missing=False, ) model.fit(X_train_with_nan, y_train) ``` ### PyTorch sparse tensor ```python import torch i = torch.tensor([[0, 1, 2], [2, 0, 1]]) v = torch.tensor([3.0, 4.0, 5.0]) sp = torch.sparse_coo_tensor(i, v, (3, 3)).coalesce() sp_csr = sp.to_sparse_csr() out = torch.sparse.mm(sp_csr, dense_matrix) ``` ### cuSPARSE-backed sparse @ dense (GPU) ```python import cupy as cp from cupyx.scipy.sparse import csr_matrix as cp_csr X_gpu = cp_csr(cp.array(X.data), ...) # or cp_csr.from_scipy y = X_gpu @ w_gpu ``` ### Polars lazy missing handling ```python import polars as pl df = ( pl.scan_parquet("events/*.parquet") .with_columns([ pl.col("price").fill_null(strategy="median").alias("price"), pl.col("category").fill_null("unknown"), pl.col("price").is_null().cast(pl.Int8).alias("price_was_missing"), ]) .collect() ) ``` ### Sparsity diagnostics ```python def sparsity_report(X): if hasattr(X, "nnz"): density = X.nnz / (X.shape[0] * X.shape[1]) else: import numpy as np density = np.count_nonzero(X) / X.size print(f"shape={X.shape}, density={density:.4%}, sparse={1-density:.4%}") ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Density < 5%, structurally sparse | scipy CSR + sparse-aware model | | Missing < 5% | Drop or median fill | | Missing 5-30% | Indicator + iterative impute | | Missing > 30% on key col | Drop column or model NaN-native (LightGBM) | | Tabular w/ NaN | LightGBM/XGBoost (native) | | Online stream | HashingVectorizer + SGDClassifier | | GPU @ scale | cuSPARSE / torch.sparse | **기본값**: scipy CSR + sklearn pipeline; tabular missing 매 LightGBM native. ## 🔗 Graph - 부모: [[Data-Engineering]] · [[Feature Engineering|Feature-Engineering]] - 변형: [[Sparse-Matrix]] · [[Imputation]] - 응용: [[TF-IDF]] · [[One-Hot-Encoding]] · [[Recommender-Systems]] - Adjacent: [[Lasso]] · [[LightGBM]] ## 🤖 LLM 활용 **언제**: 매 imputation strategy rationale, 매 sparse pipeline scaffolding, 매 missingness mechanism (MCAR/MAR/MNAR) 의 explain. **언제 X**: 매 numerical impute (use IterativeImputer/MICE), 매 distribution test (statsmodels). ## ❌ 안티패턴 - **Sparse → dense conversion**: 매 OOM 의 instant. `X.toarray()` 매 N×D > 10^9 매 ban. - **Mean-fill skewed dist**: 매 long-tail 매 distort. 매 median 의 default. - **Drop NaN row 매 30% lose**: 매 power loss. 매 indicator + impute. - **Imputation 의 leakage**: 매 train+test 동시 fit 매 leak. 매 train-only fit, transform both. - **MNAR ignored**: 매 missing-not-at-random 매 imputation 의 bias. 매 indicator 의 critical. - **Sparse + StandardScaler(with_mean=True)**: 매 densify. 매 `with_mean=False`. ## 🧪 검증 / 중복 - Verified (scipy.sparse docs 1.14; sklearn IterativeImputer; LightGBM missing-value docs; Little & Rubin "Statistical Analysis with Missing Data" 3rd ed). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — full content (CSR/MICE/LightGBM/cuSPARSE patterns + missingness taxonomy) |