---
id: wiki-2026-0508-exploratory-data-analysis
title: Exploratory Data Analysis (EDA)
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [EDA, data exploration, Tukey, pandas-profiling, sweetviz, descriptive analytics]
duplicate_of: none
source_trust_level: A
confidence_score: 0.98
verification_status: applied
tags: [data-science, eda, statistics, pandas, visualization, tukey, profiling]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: pandas / matplotlib / seaborn / ydata-profiling
---

# Exploratory Data Analysis (EDA)

## 매 한 줄
> **"매 model 전 의 의 의 data 의 understand"**. Tukey 1977. 매 distribution + missing + outlier + correlation + leakage. 매 modern: 매 ydata-profiling (auto), 매 LLM-aided EDA, 매 Plotly interactive.

## 매 핵심

### 매 step
1. **Schema**: 매 dtype, shape.
2. **Univariate**: 매 dist, missing.
3. **Bivariate**: 매 correlation, scatter.
4. **Outlier**: 매 IQR, z-score.
5. **Target**: 매 class balance, regression target dist.
6. **Leakage**: 매 feature → target.
7. **Time**: 매 trend, seasonality.

### 매 modern tool
- **ydata-profiling** (formerly pandas-profiling).
- **sweetviz**.
- **DataPrep**.
- **Polars** (10x faster).
- **DuckDB** in pandas.
- **Plotly Express**.
- **LLM-EDA** (Claude / ChatGPT).

## 💻 패턴

### Quick scan
```python
import pandas as pd
df = pd.read_csv('data.csv')
print(df.shape, df.dtypes, df.describe(include='all'), df.isna().sum())
```

### Auto-profile (ydata-profiling)
```python
from ydata_profiling import ProfileReport
report = ProfileReport(df, title='EDA', explorative=True)
report.to_file('eda.html')
```

### Missing pattern
```python
import missingno as msno
msno.matrix(df)
msno.heatmap(df)  # 매 missing correlation
```

### Univariate (numeric)
```python
import seaborn as sns
sns.histplot(df['amount'], kde=True)
sns.boxplot(x=df['amount'])
print(df['amount'].skew(), df['amount'].kurt())
```

### Univariate (categorical)
```python
df['category'].value_counts(normalize=True).plot(kind='bar')
```

### Outlier (IQR)
```python
def iqr_outliers(s, k=1.5):
    q1, q3 = s.quantile([0.25, 0.75])
    iqr = q3 - q1
    return s[(s < q1 - k * iqr) | (s > q3 + k * iqr)]
```

### Outlier (z-score)
```python
from scipy.stats import zscore
df[(zscore(df.select_dtypes('number')) > 3).any(axis=1)]
```

### Correlation
```python
import seaborn as sns
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
```

### Mutual information (non-linear)
```python
from sklearn.feature_selection import mutual_info_regression
mi = mutual_info_regression(X, y)
pd.Series(mi, index=X.columns).sort_values().plot(kind='barh')
```

### Pairplot
```python
sns.pairplot(df, hue='target', diag_kind='kde')
```

### Time series quick
```python
df.set_index('date').resample('W').mean().plot()
from statsmodels.tsa.seasonal import seasonal_decompose
seasonal_decompose(df['target'], period=7).plot()
```

### Class imbalance
```python
print(df['target'].value_counts(normalize=True))
sns.countplot(x='target', data=df)
```

### Leakage detection
```python
from sklearn.linear_model import LogisticRegression
for col in df.columns:
    if col == 'target': continue
    score = LogisticRegression().fit(df[[col]].fillna(0), df['target']).score(df[[col]].fillna(0), df['target'])
    if score > 0.95: print(f'⚠️ leakage suspected: {col} → target ({score:.2f})')
```

### Plotly interactive
```python
import plotly.express as px
px.scatter_matrix(df, dimensions=['a', 'b', 'c'], color='target').show()
```

### LLM-aided EDA
```python
def llm_eda(df, llm):
    schema = df.head().to_string() + '\n' + df.describe().to_string()
    prompt = f"""You are a data scientist. Given this data summary:
{schema}

Suggest:
1. 5 hypotheses to test
2. Likely data quality issues
3. Feature engineering ideas"""
    return llm.generate(prompt)
```

### High-cardinality categorical
```python
def high_cardinality(df, threshold=50):
    return [c for c in df.select_dtypes('object').columns if df[c].nunique() > threshold]
```

### Datetime feature peek
```python
def datetime_summary(s):
    return {
        'min': s.min(), 'max': s.max(),
        'gaps': s.sort_values().diff().describe(),
        'weekday_dist': s.dt.dayofweek.value_counts().to_dict(),
    }
```

### Polars (faster)
```python
import polars as pl
df = pl.read_csv('big.csv')
print(df.describe())
df.group_by('cat').agg(pl.col('amount').mean()).head()
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Quick check | describe + missing + value_counts |
| Auto-report | ydata-profiling |
| Big data | Polars / DuckDB |
| Interactive | Plotly Express |
| ML prep | + leakage + correlation |
| LLM aided | Schema → suggest hypotheses |

**기본값**: 매 schema + missing + describe + correlation + ydata-profiling 빠른 진단 + 매 leakage scan.

## 🔗 Graph
- 부모: [[Data-Science]] · [[Statistics]]
- 응용: [[Feature Engineering]] · [[Data-Cleaning]]
- Adjacent: [[Tukey]]

## 🤖 LLM 활용
**언제**: 매 모든 ML / DS 프로젝트 시작.
**언제 X**: 매 already-known schema.

## ❌ 안티패턴
- **Skip EDA**: 매 모델 의 garbage in.
- **Auto-only**: 매 domain context 의 miss.
- **No leakage check**: 매 fake high score.
- **Plot everything**: 매 noise.
- **Ignore class imbalance**: 매 wrong metric.

## 🧪 검증 / 중복
- Verified (Tukey 1977, Wickham R for DS).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-04-26 | EDA auto |
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — EDA + 매 ydata / IQR / MI / leakage / LLM code |