d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6.2 KiB
6.2 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-factor-analysis | Factor Analysis | 10_Wiki/Topics | verified | self |
|
none | A | 0.95 | applied |
|
2026-05-10 | pending |
|
Factor Analysis
매 한 줄
"매 latent factor 의 의 의 observed variable 의 explain". 매 EFA (exploratory) → 매 structure 의 discover. 매 CFA (confirmatory) → 매 hypothesis 의 test. 매 PCA 와 다름 — 매 FA 의 latent + error decompose. 매 famous: 매 Spearman g, Big Five.
매 핵심
매 model
X = ΛF + ε
- X: 매 observed (n×p).
- F: 매 factors (n×k), latent.
- Λ: 매 loadings (p×k).
- ε: 매 unique error.
매 PCA vs FA
- PCA: 매 variance 의 maximize, 매 component = linear combo.
- FA: 매 covariance 의 explain, 매 latent factor + error.
매 EFA vs CFA
- EFA: 매 # factor 의 unknown.
- CFA: 매 hypothesis 의 confirm (SEM).
매 step (EFA)
- KMO + Bartlett: 매 factorability.
- # factor: 매 scree, parallel analysis, MAP.
- Extract: 매 PAF, ML.
- Rotate: 매 varimax (orthogonal), oblimin (oblique).
- Interpret.
매 응용
- Psychometrics: 매 Big Five.
- Marketing: 매 brand perception.
- Finance: 매 risk factor.
- Bioinfo: 매 gene expression.
- NLP: 매 word factor.
💻 패턴
Factorability check (Python)
from factor_analyzer.factor_analyzer import calculate_kmo, calculate_bartlett_sphericity
chi_sq, p = calculate_bartlett_sphericity(df)
print(f'Bartlett: chi2={chi_sq:.2f}, p={p:.4f}') # 매 p<0.05 OK
kmo_all, kmo_model = calculate_kmo(df)
print(f'KMO: {kmo_model:.2f}') # 매 > 0.6 acceptable, > 0.8 great
Scree + parallel analysis
import numpy as np
import matplotlib.pyplot as plt
from factor_analyzer import FactorAnalyzer
fa = FactorAnalyzer(rotation=None)
fa.fit(df)
ev, v = fa.get_eigenvalues()
plt.plot(range(1, len(ev) + 1), ev, 'o-')
plt.axhline(1, color='red', ls='--') # 매 Kaiser
plt.title('Scree')
plt.show()
EFA (varimax rotation)
fa = FactorAnalyzer(n_factors=5, rotation='varimax').fit(df)
loadings = pd.DataFrame(fa.loadings_, index=df.columns, columns=[f'F{i+1}' for i in range(5)])
print(loadings.round(2))
Interpretation (high-loading items)
def interpret_factors(loadings, threshold=0.4):
for col in loadings.columns:
items = loadings[loadings[col].abs() > threshold].index.tolist()
print(f'{col}: {items}')
CFA (lavaan-style in semopy)
from semopy import Model
desc = """
Conscientiousness =~ orderly + reliable + careful
Openness =~ creative + curious + imaginative
Extraversion =~ sociable + assertive + energetic
Conscientiousness ~~ Openness
"""
model = Model(desc)
model.fit(df)
print(model.inspect())
Item difficulty (loading magnitude)
def factor_quality(loadings):
return {
'avg_loading': loadings.abs().mean(),
'cross_loadings': (loadings.abs() > 0.4).sum(axis=1).gt(1).sum(),
'low_communality': (loadings.abs().pow(2).sum(axis=1) < 0.3).sum(),
}
Reliability (Cronbach α)
def cronbach_alpha(items):
"""매 매 factor 의 internal consistency."""
k = items.shape[1]
return k / (k - 1) * (1 - items.var(ddof=1).sum() / items.sum(axis=1).var(ddof=1))
Big Five inventory
BIG_FIVE_ITEMS = {
'Openness': ['imaginative', 'curious', 'creative', 'broad_interest'],
'Conscientiousness': ['organized', 'thorough', 'reliable', 'efficient'],
'Extraversion': ['outgoing', 'energetic', 'assertive', 'talkative'],
'Agreeableness': ['kind', 'trusting', 'cooperative', 'forgiving'],
'Neuroticism': ['anxious', 'moody', 'stress', 'worry'],
}
Number of factors (parallel analysis)
def parallel_analysis(df, n_iter=100):
"""매 randomly permuted data 의 eigen 의 95th percentile."""
n, p = df.shape
rand_eigs = []
for _ in range(n_iter):
rand = np.random.normal(0, 1, (n, p))
ev = np.linalg.eigvalsh(np.corrcoef(rand.T))[::-1]
rand_eigs.append(ev)
threshold = np.percentile(rand_eigs, 95, axis=0)
actual = np.linalg.eigvalsh(np.corrcoef(df.T))[::-1]
return np.sum(actual > threshold)
MIMIC / SEM
desc = """
# 매 measurement
Latent =~ x1 + x2 + x3
# 매 structural
Latent ~ age + sex
"""
Score factor (after fit)
factor_scores = fa.transform(df)
df['factor_1'] = factor_scores[:, 0]
Bayesian FA (PyMC)
import pymc as pm
with pm.Model() as bfa:
L = pm.Normal('L', 0, 1, shape=(p, k))
F = pm.Normal('F', 0, 1, shape=(n, k))
sigma = pm.HalfNormal('sigma', 1, shape=p)
pm.Normal('x', mu=F @ L.T, sigma=sigma, observed=X)
trace = pm.sample()
매 결정 기준
| 상황 | Approach |
|---|---|
| Discover structure | EFA + parallel analysis |
| Test hypothesis | CFA (semopy / lavaan) |
| Pure dim reduction | PCA |
| Latent + measurement error | FA |
| Psychometrics | EFA → CFA |
| Causal latent | SEM (MIMIC) |
기본값: 매 EFA → 매 # factor (parallel) → 매 oblimin rotation → 매 CFA hypothesis confirm + 매 reliability check.
🔗 Graph
🤖 LLM 활용
언제: 매 questionnaire. 매 latent construct. 언제 X: 매 pure dim reduction (use PCA).
❌ 안티패턴
- PCA = FA confusion: 매 different.
- No factorability check: 매 garbage in.
- Extract too many factors: 매 noise.
- No rotation interp: 매 unintepretable.
- No reliability: 매 factor 의 trust.
🧪 검증 / 중복
- Verified (Spearman 1904, Thurstone, Costa & McCrae Big Five).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-04-26 | STAT-FACTOR auto |
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — EFA / CFA + 매 KMO / scree / varimax / Cronbach code |