Files
2nd/10_Wiki/Topics/AI_and_ML/Exploratory-Data-Analysis.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.6 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-exploratory-data-analysis Exploratory Data Analysis (EDA) 10_Wiki/Topics verified self
EDA
data exploration
Tukey
pandas-profiling
sweetviz
descriptive analytics
none A 0.98 applied
data-science
eda
statistics
pandas
visualization
tukey
profiling
2026-05-10 pending
language framework
Python pandas / matplotlib / seaborn / ydata-profiling

Exploratory Data Analysis (EDA)

매 한 줄

"매 model 전 의 의 의 data 의 understand". Tukey 1977. 매 distribution + missing + outlier + correlation + leakage. 매 modern: 매 ydata-profiling (auto), 매 LLM-aided EDA, 매 Plotly interactive.

매 핵심

매 step

  1. Schema: 매 dtype, shape.
  2. Univariate: 매 dist, missing.
  3. Bivariate: 매 correlation, scatter.
  4. Outlier: 매 IQR, z-score.
  5. Target: 매 class balance, regression target dist.
  6. Leakage: 매 feature → target.
  7. Time: 매 trend, seasonality.

매 modern tool

  • ydata-profiling (formerly pandas-profiling).
  • sweetviz.
  • DataPrep.
  • Polars (10x faster).
  • DuckDB in pandas.
  • Plotly Express.
  • LLM-EDA (Claude / ChatGPT).

💻 패턴

Quick scan

import pandas as pd
df = pd.read_csv('data.csv')
print(df.shape, df.dtypes, df.describe(include='all'), df.isna().sum())

Auto-profile (ydata-profiling)

from ydata_profiling import ProfileReport
report = ProfileReport(df, title='EDA', explorative=True)
report.to_file('eda.html')

Missing pattern

import missingno as msno
msno.matrix(df)
msno.heatmap(df)  # 매 missing correlation

Univariate (numeric)

import seaborn as sns
sns.histplot(df['amount'], kde=True)
sns.boxplot(x=df['amount'])
print(df['amount'].skew(), df['amount'].kurt())

Univariate (categorical)

df['category'].value_counts(normalize=True).plot(kind='bar')

Outlier (IQR)

def iqr_outliers(s, k=1.5):
    q1, q3 = s.quantile([0.25, 0.75])
    iqr = q3 - q1
    return s[(s < q1 - k * iqr) | (s > q3 + k * iqr)]

Outlier (z-score)

from scipy.stats import zscore
df[(zscore(df.select_dtypes('number')) > 3).any(axis=1)]

Correlation

import seaborn as sns
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')

Mutual information (non-linear)

from sklearn.feature_selection import mutual_info_regression
mi = mutual_info_regression(X, y)
pd.Series(mi, index=X.columns).sort_values().plot(kind='barh')

Pairplot

sns.pairplot(df, hue='target', diag_kind='kde')

Time series quick

df.set_index('date').resample('W').mean().plot()
from statsmodels.tsa.seasonal import seasonal_decompose
seasonal_decompose(df['target'], period=7).plot()

Class imbalance

print(df['target'].value_counts(normalize=True))
sns.countplot(x='target', data=df)

Leakage detection

from sklearn.linear_model import LogisticRegression
for col in df.columns:
    if col == 'target': continue
    score = LogisticRegression().fit(df[[col]].fillna(0), df['target']).score(df[[col]].fillna(0), df['target'])
    if score > 0.95: print(f'⚠️ leakage suspected: {col} → target ({score:.2f})')

Plotly interactive

import plotly.express as px
px.scatter_matrix(df, dimensions=['a', 'b', 'c'], color='target').show()

LLM-aided EDA

def llm_eda(df, llm):
    schema = df.head().to_string() + '\n' + df.describe().to_string()
    prompt = f"""You are a data scientist. Given this data summary:
{schema}

Suggest:
1. 5 hypotheses to test
2. Likely data quality issues
3. Feature engineering ideas"""
    return llm.generate(prompt)

High-cardinality categorical

def high_cardinality(df, threshold=50):
    return [c for c in df.select_dtypes('object').columns if df[c].nunique() > threshold]

Datetime feature peek

def datetime_summary(s):
    return {
        'min': s.min(), 'max': s.max(),
        'gaps': s.sort_values().diff().describe(),
        'weekday_dist': s.dt.dayofweek.value_counts().to_dict(),
    }

Polars (faster)

import polars as pl
df = pl.read_csv('big.csv')
print(df.describe())
df.group_by('cat').agg(pl.col('amount').mean()).head()

매 결정 기준

상황 Approach
Quick check describe + missing + value_counts
Auto-report ydata-profiling
Big data Polars / DuckDB
Interactive Plotly Express
ML prep + leakage + correlation
LLM aided Schema → suggest hypotheses

기본값: 매 schema + missing + describe + correlation + ydata-profiling 빠른 진단 + 매 leakage scan.

🔗 Graph

🤖 LLM 활용

언제: 매 모든 ML / DS 프로젝트 시작. 언제 X: 매 already-known schema.

안티패턴

  • Skip EDA: 매 모델 의 garbage in.
  • Auto-only: 매 domain context 의 miss.
  • No leakage check: 매 fake high score.
  • Plot everything: 매 noise.
  • Ignore class imbalance: 매 wrong metric.

🧪 검증 / 중복

  • Verified (Tukey 1977, Wickham R for DS).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-04-26 EDA auto
2026-05-08 Phase 1
2026-05-10 Manual cleanup — EDA + 매 ydata / IQR / MI / leakage / LLM code