--- id: wiki-2026-0508-exploratory-data-analysis title: Exploratory Data Analysis (EDA) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [EDA, data exploration, Tukey, pandas-profiling, sweetviz, descriptive analytics] duplicate_of: none source_trust_level: A confidence_score: 0.98 verification_status: applied tags: [data-science, eda, statistics, pandas, visualization, tukey, profiling] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: pandas / matplotlib / seaborn / ydata-profiling --- # Exploratory Data Analysis (EDA) ## 매 한 줄 > **"매 model 전 의 의 의 data 의 understand"**. Tukey 1977. 매 distribution + missing + outlier + correlation + leakage. 매 modern: 매 ydata-profiling (auto), 매 LLM-aided EDA, 매 Plotly interactive. ## 매 핵심 ### 매 step 1. **Schema**: 매 dtype, shape. 2. **Univariate**: 매 dist, missing. 3. **Bivariate**: 매 correlation, scatter. 4. **Outlier**: 매 IQR, z-score. 5. **Target**: 매 class balance, regression target dist. 6. **Leakage**: 매 feature → target. 7. **Time**: 매 trend, seasonality. ### 매 modern tool - **ydata-profiling** (formerly pandas-profiling). - **sweetviz**. - **DataPrep**. - **Polars** (10x faster). - **DuckDB** in pandas. - **Plotly Express**. - **LLM-EDA** (Claude / ChatGPT). ## 💻 패턴 ### Quick scan ```python import pandas as pd df = pd.read_csv('data.csv') print(df.shape, df.dtypes, df.describe(include='all'), df.isna().sum()) ``` ### Auto-profile (ydata-profiling) ```python from ydata_profiling import ProfileReport report = ProfileReport(df, title='EDA', explorative=True) report.to_file('eda.html') ``` ### Missing pattern ```python import missingno as msno msno.matrix(df) msno.heatmap(df) # 매 missing correlation ``` ### Univariate (numeric) ```python import seaborn as sns sns.histplot(df['amount'], kde=True) sns.boxplot(x=df['amount']) print(df['amount'].skew(), df['amount'].kurt()) ``` ### Univariate (categorical) ```python df['category'].value_counts(normalize=True).plot(kind='bar') ``` ### Outlier (IQR) ```python def iqr_outliers(s, k=1.5): q1, q3 = s.quantile([0.25, 0.75]) iqr = q3 - q1 return s[(s < q1 - k * iqr) | (s > q3 + k * iqr)] ``` ### Outlier (z-score) ```python from scipy.stats import zscore df[(zscore(df.select_dtypes('number')) > 3).any(axis=1)] ``` ### Correlation ```python import seaborn as sns corr = df.corr(numeric_only=True) sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f') ``` ### Mutual information (non-linear) ```python from sklearn.feature_selection import mutual_info_regression mi = mutual_info_regression(X, y) pd.Series(mi, index=X.columns).sort_values().plot(kind='barh') ``` ### Pairplot ```python sns.pairplot(df, hue='target', diag_kind='kde') ``` ### Time series quick ```python df.set_index('date').resample('W').mean().plot() from statsmodels.tsa.seasonal import seasonal_decompose seasonal_decompose(df['target'], period=7).plot() ``` ### Class imbalance ```python print(df['target'].value_counts(normalize=True)) sns.countplot(x='target', data=df) ``` ### Leakage detection ```python from sklearn.linear_model import LogisticRegression for col in df.columns: if col == 'target': continue score = LogisticRegression().fit(df[[col]].fillna(0), df['target']).score(df[[col]].fillna(0), df['target']) if score > 0.95: print(f'⚠️ leakage suspected: {col} → target ({score:.2f})') ``` ### Plotly interactive ```python import plotly.express as px px.scatter_matrix(df, dimensions=['a', 'b', 'c'], color='target').show() ``` ### LLM-aided EDA ```python def llm_eda(df, llm): schema = df.head().to_string() + '\n' + df.describe().to_string() prompt = f"""You are a data scientist. Given this data summary: {schema} Suggest: 1. 5 hypotheses to test 2. Likely data quality issues 3. Feature engineering ideas""" return llm.generate(prompt) ``` ### High-cardinality categorical ```python def high_cardinality(df, threshold=50): return [c for c in df.select_dtypes('object').columns if df[c].nunique() > threshold] ``` ### Datetime feature peek ```python def datetime_summary(s): return { 'min': s.min(), 'max': s.max(), 'gaps': s.sort_values().diff().describe(), 'weekday_dist': s.dt.dayofweek.value_counts().to_dict(), } ``` ### Polars (faster) ```python import polars as pl df = pl.read_csv('big.csv') print(df.describe()) df.group_by('cat').agg(pl.col('amount').mean()).head() ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Quick check | describe + missing + value_counts | | Auto-report | ydata-profiling | | Big data | Polars / DuckDB | | Interactive | Plotly Express | | ML prep | + leakage + correlation | | LLM aided | Schema → suggest hypotheses | **기본값**: 매 schema + missing + describe + correlation + ydata-profiling 빠른 진단 + 매 leakage scan. ## 🔗 Graph - 부모: [[Data-Science]] · [[Statistics]] - 응용: [[Feature Engineering]] · [[Data-Cleaning]] - Adjacent: [[Tukey]] ## 🤖 LLM 활용 **언제**: 매 모든 ML / DS 프로젝트 시작. **언제 X**: 매 already-known schema. ## ❌ 안티패턴 - **Skip EDA**: 매 모델 의 garbage in. - **Auto-only**: 매 domain context 의 miss. - **No leakage check**: 매 fake high score. - **Plot everything**: 매 noise. - **Ignore class imbalance**: 매 wrong metric. ## 🧪 검증 / 중복 - Verified (Tukey 1977, Wickham R for DS). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-04-26 | EDA auto | | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — EDA + 매 ydata / IQR / MI / leakage / LLM code |