Files
2nd/10_Wiki/Topics/AI_and_ML/Computational-Linguistics.md
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

7.6 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-computational-linguistics Computational Linguistics 10_Wiki/Topics verified self
computational linguistics
NLP roots
syntax
semantics
pragmatics
formal grammar
Chomsky
none A 0.88 applied
linguistics
nlp
syntax
semantics
parsing
llm
chomsky
formal-grammar
2026-05-10 pending
language framework
Python spaCy / NLTK / Stanza / Transformers

Computational Linguistics

매 한 줄

"매 language 의 mathematical model". NLP 의 academic 의 root. 매 syntax + semantics + pragmatics + 매 morphology + phonology. 매 modern: 매 LLM 가 dominant 가, 매 linguistics 의 understanding 의 still relevant (eval, hallucination, multilingual).

매 핵심 layer

Phonology / Phonetics

  • 매 sound system.
  • 매 IPA, 매 phoneme.

Morphology

  • 매 word structure.
  • 매 inflection, derivation.
  • 매 agglutinative (Korean, Turkish) vs analytic (Mandarin).

Syntax

  • 매 sentence structure.
  • 매 parser, grammar.

Semantics

  • 매 meaning.
  • 매 word sense, predicate-argument.

Pragmatics

  • 매 context, intent.
  • 매 implicature, speech act.

Discourse

  • 매 multi-sentence, coherence.

Sociolinguistics

  • 매 register, dialect.

매 method history

Symbolic / Rule-based (1950s-80s)

  • Chomsky transformational grammar.
  • HPSG, LFG, CCG.
  • Expert system.

Statistical (1990s-2010s)

  • Hidden Markov Model (POS).
  • PCFG (probabilistic CFG).
  • IBM machine translation.
  • BLEU metric.

Neural (2010s-2020s)

  • Word2Vec, GloVe.
  • LSTM seq2seq.
  • BERT, GPT.

LLM (2022+)

  • 매 implicit linguistics knowledge.
  • 매 emergent.
  • 매 multilingual zero-shot.

매 task

  • POS tagging: noun, verb, ...
  • Parsing: dependency, constituent.
  • NER: named entity.
  • Coreference resolution.
  • Word Sense Disambiguation.
  • Machine Translation.
  • Sentiment.
  • Summarization.
  • QA.
  • Dialogue.

매 modern relevance

  • LLM eval: 매 specific linguistic phenomenon (BLiMP).
  • Multilingual NLP: 매 typology-aware.
  • Hallucination analysis: 매 syntax / semantics 의 mismatch.
  • Low-resource language.
  • Code-switching.

매 famous resource

  • WordNet: 매 lexical database.
  • FrameNet: 매 semantic frames.
  • PropBank / Penn Treebank.
  • Universal Dependencies.
  • CommonCrawl + OSCAR.

💻 패턴

POS tagging (spaCy)

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp('The quick brown fox jumps over the lazy dog')
for token in doc:
    print(f'{token.text:<10} {token.pos_:<10} {token.tag_}')

Dependency parsing

doc = nlp('Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(f'{token.text:<15} {token.dep_:<10}{token.head.text}')

# 매 visualize
spacy.displacy.serve(doc, style='dep')

NER

import spacy
nlp = spacy.load('en_core_web_trf')  # 매 transformer-based
doc = nlp('Apple is looking at buying U.K. startup for $1 billion in 2024')
for ent in doc.ents:
    print(f'{ent.text}: {ent.label_}')
# Apple: ORG, U.K.: GPE, $1 billion: MONEY, 2024: DATE

Universal Dependencies (Stanza)

import stanza
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse')
doc = nlp('I drove to Berlin yesterday.')
for sent in doc.sentences:
    for w in sent.words:
        print(f'{w.text:<10} {w.upos:<8}{sent.words[w.head-1].text if w.head > 0 else "ROOT"}')

Constituency parsing (benepar)

import benepar, spacy
nlp = spacy.load('en_core_web_md')
nlp.add_pipe('benepar', config={'model': 'benepar_en3'})
doc = nlp('The quick brown fox jumps over the lazy dog.')
for sent in doc.sents:
    print(sent._.parse_string)
# (S (NP (DT The) (JJ quick) (JJ brown) (NN fox)) (VP (VBZ jumps) ...))

Word sense disambiguation

from nltk.corpus import wordnet
from nltk.wsd import lesk

context = 'I went to the bank to deposit money'
sense = lesk(context.split(), 'bank')
print(sense)        # Synset('depository_financial_institution.n.01')
print(sense.definition())

LLM 의 linguistic eval (BLiMP)

# 매 BLiMP: 매 67 minimal pair phenomenon
def blimp_score(model, blimp_examples):
    correct = 0
    for ex in blimp_examples:
        ll_good = model.score(ex.acceptable_sentence)
        ll_bad = model.score(ex.unacceptable_sentence)
        if ll_good > ll_bad: correct += 1
    return correct / len(blimp_examples)

Multilingual (XLM-R)

from transformers import pipeline
pipe = pipeline('fill-mask', model='xlm-roberta-large')

# 매 zero-shot multilingual
print(pipe('Hello, my name is <mask>.'))
print(pipe('Bonjour, je m\'appelle <mask>.'))
print(pipe('안녕하세요, 제 이름은 <mask>입니다.'))

Code-switching detection

def detect_codeswitch(text, langid_model):
    """매 sentence 의 multiple language 의 detect."""
    tokens = text.split()
    langs = [langid_model.predict(t) for t in tokens]
    unique_langs = set(langs)
    if len(unique_langs) > 1:
        return f'Code-switching: {unique_langs}'
    return None

Linguistic feature extraction (Korean morphology)

from konlpy.tag import Mecab
mecab = Mecab()

text = '나는 학교에 갔다'
print(mecab.pos(text))
# [('나', 'NP'), ('는', 'JX'), ('학교', 'NNG'), ('에', 'JKB'), ('가', 'VV'), ('았', 'EP'), ('다', 'EF')]

Hallucination via syntactic check

def syntactic_consistency_check(generated, source_facts):
    """매 LLM 의 generated 의 매 source 의 entity 의 match?"""
    gen_doc = nlp(generated)
    gen_entities = {(ent.text, ent.label_) for ent in gen_doc.ents}
    
    source_entities = extract_entities(source_facts)
    
    invented = gen_entities - source_entities
    if invented:
        return f'Possible hallucination: {invented}'
    return None

🤔 결정 기준

응용 Tool
Production NLP spaCy / Stanza
Korean Mecab / KoNLPy
State-of-art Transformers (HF)
Linguistic phenomenon eval BLiMP / SuperGLUE
Multilingual XLM-R / mBERT
Low-resource Parameter-efficient FT
Discourse Coref + LLM
Hallucination NER + cross-check

기본값: spaCy (production) + Transformers (SOTA).

🔗 Graph

🤖 LLM 활용

언제: 매 NLP system 설계. 매 LLM eval 의 linguistic 측. 매 multilingual product. 매 hallucination analysis. 언제 X: 매 simple text task (LLM 의 enough).

안티패턴

  • English-only assumption: 매 multilingual fail.
  • No morphology (agglutinative): 매 Korean / Turkish / Finnish 의 fail.
  • Statistical era 의 stuck: 매 LLM 의 leverage X.
  • LLM 의 alone (no linguistic eval): 매 specific phenomenon 의 miss.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — layer + history + 매 spaCy / Stanza / BLiMP / XLM-R / Korean code