Files
2nd/10_Wiki/Topics/AI_and_ML/Sparse-Data-Handling.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

7.5 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-sparse-data-handling Sparse Data Handling 10_Wiki/Topics verified self
Sparse Matrix
Sparse Features
Missing Data
Imputation
none A 0.92 applied
sparse
data-engineering
ml
imputation
scipy
2026-05-10 pending
language framework
Python scipy/scikit-learn/PyTorch

Sparse Data Handling

매 한 줄

"매 zero/missing 매 majority 일 때 — storage + algorithm 의 sparse-aware 의 switch". Sparse data handling 매 high-cardinality categorical (one-hot, TF-IDF, click-stream) + missing-value imputation + sparse model (L1, factorization machine) 의 수렴. 매 2026 매 scipy.sparse + cuSPARSE / PyTorch sparse + Polars LazyFrame 매 production stack.

매 핵심

매 2 axes 의 "sparse"

  1. Structurally sparse: 매 most cell = 0 by definition (one-hot, adjacency, term-doc matrix).
  2. Missing: 매 NaN/None, 매 not-recorded. 매 imputation needed.
  • 매 두 매 different problem. Conflate 의 X.

매 sparse format (scipy)

  • CSR (Compressed Sparse Row): row slice fast, matrix-vector fast. 매 ML 의 default.
  • CSC: column slice fast. 매 column-wise stat.
  • COO: build-time. 매 fast construct, slow op.
  • DOK / LIL: incremental build.
  • BSR: block-sparse. 매 GPU friendly.

매 sparse-aware models

  • Linear w/ L1 (Lasso): 매 sparse output. SGD/coordinate-descent.
  • Logistic regression: liblinear / saga 매 sparse input handle.
  • Tree (XGBoost/LightGBM): 매 missing 매 native split direction 의 learn.
  • Factorization Machines (libfm, xLearn): 매 high-card categorical.
  • HashingVectorizer: 매 fixed-dim feature hashing.

매 missing-value strategies

  1. Drop: row/col 매 sparse-too-much 일 때.
  2. Constant fill: 0 / mean / median / mode.
  3. Indicator + fill: 매 missingness 의 carry as feature.
  4. KNN impute: 매 small data.
  5. Iterative (MICE): 매 chained regression. sklearn IterativeImputer.
  6. Tree-native: 매 LightGBM/XGBoost 매 NaN 의 직접 handle.
  7. Deep: 매 VAE-imputation, GAIN.

매 응용

  1. NLP (TF-IDF, BoW).
  2. Recommender (user-item interaction).
  3. Genomics (one-hot variant matrix).
  4. Click-stream / session data.
  5. Tabular ML w/ missing fields.

💻 패턴

scipy.sparse construction

from scipy.sparse import csr_matrix, coo_matrix
import numpy as np

# COO build then convert
rows = np.array([0, 1, 2, 0])
cols = np.array([0, 2, 1, 3])
data = np.array([1.0, 2.0, 3.0, 4.0])
M = coo_matrix((data, (rows, cols)), shape=(3, 4)).tocsr()

# Slice / op
print(M.shape, M.nnz, M.density if hasattr(M, 'density') else M.nnz / np.prod(M.shape))
print(M[0])             # row slice
print(M @ M.T)          # sparse @ sparse

sklearn sparse pipeline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=200_000, ngram_range=(1,2))),
    ("clf", LogisticRegression(solver="liblinear", penalty="l1", C=1.0)),
])
pipe.fit(texts_train, y_train)
# tfidf returns CSR; LR liblinear handles sparse natively

HashingVectorizer (no vocab, online)

from sklearn.feature_extraction.text import HashingVectorizer
hv = HashingVectorizer(n_features=2**20, alternate_sign=False)
X = hv.transform(stream_of_docs)  # CSR

Iterative imputation (MICE)

from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
from sklearn.ensemble import HistGradientBoostingRegressor

imp = IterativeImputer(
    estimator=HistGradientBoostingRegressor(),
    max_iter=10,
    random_state=0,
)
X_filled = imp.fit_transform(X_with_nan)

Missing indicator + fill

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.pipeline import FeatureUnion

union = FeatureUnion([
    ("fill_median", SimpleImputer(strategy="median")),
    ("ind",         MissingIndicator(features="all")),
])
X = union.fit_transform(X_raw)

LightGBM native NaN handling

import lightgbm as lgb
# No imputation needed
model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    num_leaves=63,
    use_missing=True,
    zero_as_missing=False,
)
model.fit(X_train_with_nan, y_train)

PyTorch sparse tensor

import torch

i = torch.tensor([[0, 1, 2], [2, 0, 1]])
v = torch.tensor([3.0, 4.0, 5.0])
sp = torch.sparse_coo_tensor(i, v, (3, 3)).coalesce()
sp_csr = sp.to_sparse_csr()
out = torch.sparse.mm(sp_csr, dense_matrix)

cuSPARSE-backed sparse @ dense (GPU)

import cupy as cp
from cupyx.scipy.sparse import csr_matrix as cp_csr

X_gpu = cp_csr(cp.array(X.data), ...)  # or cp_csr.from_scipy
y = X_gpu @ w_gpu

Polars lazy missing handling

import polars as pl

df = (
    pl.scan_parquet("events/*.parquet")
      .with_columns([
          pl.col("price").fill_null(strategy="median").alias("price"),
          pl.col("category").fill_null("unknown"),
          pl.col("price").is_null().cast(pl.Int8).alias("price_was_missing"),
      ])
      .collect()
)

Sparsity diagnostics

def sparsity_report(X):
    if hasattr(X, "nnz"):
        density = X.nnz / (X.shape[0] * X.shape[1])
    else:
        import numpy as np
        density = np.count_nonzero(X) / X.size
    print(f"shape={X.shape}, density={density:.4%}, sparse={1-density:.4%}")

매 결정 기준

상황 Approach
Density < 5%, structurally sparse scipy CSR + sparse-aware model
Missing < 5% Drop or median fill
Missing 5-30% Indicator + iterative impute
Missing > 30% on key col Drop column or model NaN-native (LightGBM)
Tabular w/ NaN LightGBM/XGBoost (native)
Online stream HashingVectorizer + SGDClassifier
GPU @ scale cuSPARSE / torch.sparse

기본값: scipy CSR + sklearn pipeline; tabular missing 매 LightGBM native.

🔗 Graph

🤖 LLM 활용

언제: 매 imputation strategy rationale, 매 sparse pipeline scaffolding, 매 missingness mechanism (MCAR/MAR/MNAR) 의 explain. 언제 X: 매 numerical impute (use IterativeImputer/MICE), 매 distribution test (statsmodels).

안티패턴

  • Sparse → dense conversion: 매 OOM 의 instant. X.toarray() 매 N×D > 10^9 매 ban.
  • Mean-fill skewed dist: 매 long-tail 매 distort. 매 median 의 default.
  • Drop NaN row 매 30% lose: 매 power loss. 매 indicator + impute.
  • Imputation 의 leakage: 매 train+test 동시 fit 매 leak. 매 train-only fit, transform both.
  • MNAR ignored: 매 missing-not-at-random 매 imputation 의 bias. 매 indicator 의 critical.
  • Sparse + StandardScaler(with_mean=True): 매 densify. 매 with_mean=False.

🧪 검증 / 중복

  • Verified (scipy.sparse docs 1.14; sklearn IterativeImputer; LightGBM missing-value docs; Little & Rubin "Statistical Analysis with Missing Data" 3rd ed).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — full content (CSR/MICE/LightGBM/cuSPARSE patterns + missingness taxonomy)