--- id: wiki-2026-0508-pmi-technique title: PMI Technique (Pointwise Mutual Information) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [PMI, Pointwise Mutual Information, PPMI, Word Association] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [nlp, pmi, statistics, collocation, embeddings, information-theory] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: { language: python, framework: numpy/scikit-learn } --- # PMI Technique (Pointwise Mutual Information) ## 한 줄 두 사건이 독립일 때 대비 얼마나 더 함께 등장하는가를 로그-비율로 측정하는 점별 상호정보량 — NLP collocation·연관 측정의 기초. ## 핵심 - 정의: `PMI(x, y) = log( P(x, y) / (P(x) P(y)) )`. - > 0: 양의 연관, = 0: 독립, < 0: 음의 연관(희귀, 노이즈 많음). - **PPMI** = max(PMI, 0) — 음수 절단으로 안정. - **NPMI** = `PMI / -log P(x, y)` ∈ [-1, 1], 빈도 편향 완화. - **k-shift PMI**: SGNS(word2vec)는 implicit하게 `PMI - log k` 인수분해(Levy & Goldberg). - 활용: collocation 추출, topic 평가(Coherence_NPMI), 워드 임베딩 baseline(SVD on PPMI), feature selection, RAG retrieval re-ranking. - 단점: 저빈도 쌍이 PMI 폭증 → 빈도 임계 / shift / NPMI 필요. ## 💻 패턴 ```python # 1. PMI 직접 계산 (co-occurrence matrix) import numpy as np from collections import Counter corpus = "the cat sat on the mat the cat purred the dog ran".split() window = 2 pair_c, word_c = Counter(), Counter() for i, w in enumerate(corpus): word_c[w] += 1 for j in range(max(0, i-window), min(len(corpus), i+window+1)): if i != j: pair_c[(w, corpus[j])] += 1 total_pairs = sum(pair_c.values()) total_words = sum(word_c.values()) def pmi(x, y): p_xy = pair_c[(x, y)] / total_pairs p_x = word_c[x] / total_words p_y = word_c[y] / total_words return np.log2(p_xy / (p_x * p_y)) print(f"PMI(cat, sat) = {pmi('cat','sat'):.3f}") ``` ```python # 2. PPMI matrix (전체 어휘) — sparse import numpy as np from scipy.sparse import csr_matrix def build_ppmi(pair_c, word_c, vocab): idx = {w: i for i, w in enumerate(vocab)} rows, cols, data = [], [], [] N = sum(pair_c.values()) Nw = sum(word_c.values()) for (a, b), c in pair_c.items(): p_ab = c / N p_a, p_b = word_c[a] / Nw, word_c[b] / Nw v = np.log2(p_ab / (p_a * p_b)) if v > 0: rows.append(idx[a]); cols.append(idx[b]); data.append(v) return csr_matrix((data, (rows, cols)), shape=(len(vocab), len(vocab))) vocab = sorted(word_c) M = build_ppmi(pair_c, word_c, vocab) ``` ```python # 3. NPMI (정규화) def npmi(x, y): p_xy = pair_c[(x, y)] / total_pairs p_x = word_c[x] / total_words p_y = word_c[y] / total_words if p_xy == 0: return -1 return np.log2(p_xy / (p_x * p_y)) / -np.log2(p_xy) ``` ```python # 4. SVD on PPMI → low-rank word embeddings (count-based) from scipy.sparse.linalg import svds import numpy as np U, s, Vt = svds(M.astype(float), k=100) emb = U * np.sqrt(s) # 100-d static embedding per word # 코사인 유사도로 nearest-word 검색 가능 ``` ```python # 5. gensim Phrases — bigram collocation by NPMI from gensim.models.phrases import Phrases, Phraser sents = [["new", "york", "city"], ["machine", "learning", "is", "fun"], ...] bigram = Phrases(sents, min_count=5, threshold=0.5, scoring="npmi") # threshold ∈ [-1,1] phraser = Phraser(bigram) print(phraser[["new", "york", "is", "big"]]) # ['new_york', 'is', 'big'] ``` ```python # 6. Topic Coherence (NPMI 기반) — 토픽 모델 품질 from gensim.models import CoherenceModel cm = CoherenceModel(topics=top_words_per_topic, texts=tokenized_corpus, dictionary=dictionary, coherence="c_npmi") print("c_npmi:", cm.get_coherence()) ``` ```python # 7. PMI for feature selection (text classification) import numpy as np def pmi_feature(word, label, df): p_wl = ((df["word"] == word) & (df["label"] == label)).mean() p_w = (df["word"] == word).mean() p_l = (df["label"] == label).mean() if p_wl == 0: return 0 return np.log2(p_wl / (p_w * p_l)) # 라벨별 top-PMI 단어 = 강한 신호 feature ``` ```python # 8. Shifted PMI (word2vec SGNS와 동치성) import numpy as np def spmi(x, y, k=5): p_xy = pair_c[(x, y)] / total_pairs p_x = word_c[x] / total_words p_y = word_c[y] / total_words return np.log2(p_xy / (p_x * p_y)) - np.log2(k) # Levy & Goldberg 2014: SGNS ≈ matrix factorization of shifted PMI ``` ## 결정 기준 | 목표 | 권장 | |---|---| | Collocation 추출 | NPMI + 빈도 임계(min_count) | | 토픽 모델 품질 평가 | c_npmi | | Static word embedding (small data) | SVD on PPMI | | Feature selection (분류) | PMI(word, class) | | word2vec 이론 연결 | shifted PMI (k=5~15) | | Modern semantic search | sentence embedding(BGE/E5) — PMI는 보조 | ## 🔗 Graph - Related: `[[Word-Embeddings]]`, ``, `[[Information_Theory|Information-Theory]]`, ``, ``, `[[TF-IDF]]` ## 🤖 LLM 활용 - LLM 출력 다양성 측정: 생성 토큰 쌍의 NPMI 분포로 반복도 평가. - RAG 후보 청크 키워드와 query 키워드 간 PMI로 lexical overlap 점수 보강. ## ❌ 안티패턴 - 저빈도 쌍(예: 1회 등장)을 그대로 PMI 산출 → 인공적으로 큰 값. - log-base 혼용(자연로그 vs log2) — 비교 불가. - PPMI 없이 raw PMI를 SVD에 넣어 음수 노이즈 학습. - topic coherence c_v 대신 c_npmi가 더 인간 판단과 상관 높음을 무시. ## 🧪 검증 - 알려진 collocation 쌍("New York", "machine learning")이 상위 NPMI 차지하는지 확인. - PPMI-SVD 임베딩으로 analogy(king-man+woman≈queen) 부분 작동. - coherence c_npmi 값이 0.1~0.3 범위면 표준적 토픽 품질. ## 🕓 Changelog - 2026-05-08 Phase 1: 초안. - 2026-05-10 Manual cleanup: 8 패턴, NPMI/SGNS shift/coherence 보강.