f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.6 KiB
5.6 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-exploratory-data-analysis | Exploratory Data Analysis (EDA) | 10_Wiki/Topics | verified | self |
|
none | A | 0.98 | applied |
|
2026-05-10 | pending |
|
Exploratory Data Analysis (EDA)
매 한 줄
"매 model 전 의 의 의 data 의 understand". Tukey 1977. 매 distribution + missing + outlier + correlation + leakage. 매 modern: 매 ydata-profiling (auto), 매 LLM-aided EDA, 매 Plotly interactive.
매 핵심
매 step
- Schema: 매 dtype, shape.
- Univariate: 매 dist, missing.
- Bivariate: 매 correlation, scatter.
- Outlier: 매 IQR, z-score.
- Target: 매 class balance, regression target dist.
- Leakage: 매 feature → target.
- Time: 매 trend, seasonality.
매 modern tool
- ydata-profiling (formerly pandas-profiling).
- sweetviz.
- DataPrep.
- Polars (10x faster).
- DuckDB in pandas.
- Plotly Express.
- LLM-EDA (Claude / ChatGPT).
💻 패턴
Quick scan
import pandas as pd
df = pd.read_csv('data.csv')
print(df.shape, df.dtypes, df.describe(include='all'), df.isna().sum())
Auto-profile (ydata-profiling)
from ydata_profiling import ProfileReport
report = ProfileReport(df, title='EDA', explorative=True)
report.to_file('eda.html')
Missing pattern
import missingno as msno
msno.matrix(df)
msno.heatmap(df) # 매 missing correlation
Univariate (numeric)
import seaborn as sns
sns.histplot(df['amount'], kde=True)
sns.boxplot(x=df['amount'])
print(df['amount'].skew(), df['amount'].kurt())
Univariate (categorical)
df['category'].value_counts(normalize=True).plot(kind='bar')
Outlier (IQR)
def iqr_outliers(s, k=1.5):
q1, q3 = s.quantile([0.25, 0.75])
iqr = q3 - q1
return s[(s < q1 - k * iqr) | (s > q3 + k * iqr)]
Outlier (z-score)
from scipy.stats import zscore
df[(zscore(df.select_dtypes('number')) > 3).any(axis=1)]
Correlation
import seaborn as sns
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
Mutual information (non-linear)
from sklearn.feature_selection import mutual_info_regression
mi = mutual_info_regression(X, y)
pd.Series(mi, index=X.columns).sort_values().plot(kind='barh')
Pairplot
sns.pairplot(df, hue='target', diag_kind='kde')
Time series quick
df.set_index('date').resample('W').mean().plot()
from statsmodels.tsa.seasonal import seasonal_decompose
seasonal_decompose(df['target'], period=7).plot()
Class imbalance
print(df['target'].value_counts(normalize=True))
sns.countplot(x='target', data=df)
Leakage detection
from sklearn.linear_model import LogisticRegression
for col in df.columns:
if col == 'target': continue
score = LogisticRegression().fit(df[[col]].fillna(0), df['target']).score(df[[col]].fillna(0), df['target'])
if score > 0.95: print(f'⚠️ leakage suspected: {col} → target ({score:.2f})')
Plotly interactive
import plotly.express as px
px.scatter_matrix(df, dimensions=['a', 'b', 'c'], color='target').show()
LLM-aided EDA
def llm_eda(df, llm):
schema = df.head().to_string() + '\n' + df.describe().to_string()
prompt = f"""You are a data scientist. Given this data summary:
{schema}
Suggest:
1. 5 hypotheses to test
2. Likely data quality issues
3. Feature engineering ideas"""
return llm.generate(prompt)
High-cardinality categorical
def high_cardinality(df, threshold=50):
return [c for c in df.select_dtypes('object').columns if df[c].nunique() > threshold]
Datetime feature peek
def datetime_summary(s):
return {
'min': s.min(), 'max': s.max(),
'gaps': s.sort_values().diff().describe(),
'weekday_dist': s.dt.dayofweek.value_counts().to_dict(),
}
Polars (faster)
import polars as pl
df = pl.read_csv('big.csv')
print(df.describe())
df.group_by('cat').agg(pl.col('amount').mean()).head()
매 결정 기준
| 상황 | Approach |
|---|---|
| Quick check | describe + missing + value_counts |
| Auto-report | ydata-profiling |
| Big data | Polars / DuckDB |
| Interactive | Plotly Express |
| ML prep | + leakage + correlation |
| LLM aided | Schema → suggest hypotheses |
기본값: 매 schema + missing + describe + correlation + ydata-profiling 빠른 진단 + 매 leakage scan.
🔗 Graph
- 부모: Data-Science · Statistics
- 응용: Feature Engineering · Data-Cleaning
- Adjacent: Tukey
🤖 LLM 활용
언제: 매 모든 ML / DS 프로젝트 시작. 언제 X: 매 already-known schema.
❌ 안티패턴
- Skip EDA: 매 모델 의 garbage in.
- Auto-only: 매 domain context 의 miss.
- No leakage check: 매 fake high score.
- Plot everything: 매 noise.
- Ignore class imbalance: 매 wrong metric.
🧪 검증 / 중복
- Verified (Tukey 1977, Wickham R for DS).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-04-26 | EDA auto |
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — EDA + 매 ydata / IQR / MI / leakage / LLM code |