--- id: wiki-2026-0508-computational-linguistics title: Computational Linguistics category: 10_Wiki/Topics status: verified canonical_id: self aliases: [computational linguistics, NLP roots, syntax, semantics, pragmatics, formal grammar, Chomsky] duplicate_of: none source_trust_level: A confidence_score: 0.88 verification_status: applied tags: [linguistics, nlp, syntax, semantics, parsing, llm, chomsky, formal-grammar] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: spaCy / NLTK / Stanza / Transformers --- # Computational Linguistics ## 매 한 줄 > **"매 language 의 mathematical model"**. NLP 의 academic 의 root. 매 syntax + semantics + pragmatics + 매 morphology + phonology. 매 modern: 매 LLM 가 dominant 가, 매 linguistics 의 understanding 의 still relevant (eval, hallucination, multilingual). ## 매 핵심 layer ### Phonology / Phonetics - 매 sound system. - 매 IPA, 매 phoneme. ### Morphology - 매 word structure. - 매 inflection, derivation. - 매 agglutinative (Korean, Turkish) vs analytic (Mandarin). ### Syntax - 매 sentence structure. - 매 parser, grammar. ### Semantics - 매 meaning. - 매 word sense, predicate-argument. ### Pragmatics - 매 context, intent. - 매 implicature, speech act. ### Discourse - 매 multi-sentence, coherence. ### Sociolinguistics - 매 register, dialect. ## 매 method history ### Symbolic / Rule-based (1950s-80s) - Chomsky transformational grammar. - HPSG, LFG, CCG. - Expert system. ### Statistical (1990s-2010s) - Hidden Markov Model (POS). - PCFG (probabilistic CFG). - IBM machine translation. - BLEU metric. ### Neural (2010s-2020s) - Word2Vec, GloVe. - LSTM seq2seq. - BERT, GPT. ### LLM (2022+) - 매 implicit linguistics knowledge. - 매 emergent. - 매 multilingual zero-shot. ### 매 task - **POS tagging**: noun, verb, ... - **Parsing**: dependency, constituent. - **NER**: named entity. - **Coreference resolution**. - **Word Sense Disambiguation**. - **Machine Translation**. - **Sentiment**. - **Summarization**. - **QA**. - **Dialogue**. ### 매 modern relevance - **LLM eval**: 매 specific linguistic phenomenon (BLiMP). - **Multilingual NLP**: 매 typology-aware. - **Hallucination analysis**: 매 syntax / semantics 의 mismatch. - **Low-resource language**. - **Code-switching**. ### 매 famous resource - **WordNet**: 매 lexical database. - **FrameNet**: 매 semantic frames. - **PropBank** / **Penn Treebank**. - **Universal Dependencies**. - **CommonCrawl** + **OSCAR**. ## 💻 패턴 ### POS tagging (spaCy) ```python import spacy nlp = spacy.load('en_core_web_sm') doc = nlp('The quick brown fox jumps over the lazy dog') for token in doc: print(f'{token.text:<10} {token.pos_:<10} {token.tag_}') ``` ### Dependency parsing ```python doc = nlp('Apple is looking at buying U.K. startup for $1 billion') for token in doc: print(f'{token.text:<15} {token.dep_:<10} → {token.head.text}') # 매 visualize spacy.displacy.serve(doc, style='dep') ``` ### NER ```python import spacy nlp = spacy.load('en_core_web_trf') # 매 transformer-based doc = nlp('Apple is looking at buying U.K. startup for $1 billion in 2024') for ent in doc.ents: print(f'{ent.text}: {ent.label_}') # Apple: ORG, U.K.: GPE, $1 billion: MONEY, 2024: DATE ``` ### Universal Dependencies (Stanza) ```python import stanza nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse') doc = nlp('I drove to Berlin yesterday.') for sent in doc.sentences: for w in sent.words: print(f'{w.text:<10} {w.upos:<8} → {sent.words[w.head-1].text if w.head > 0 else "ROOT"}') ``` ### Constituency parsing (benepar) ```python import benepar, spacy nlp = spacy.load('en_core_web_md') nlp.add_pipe('benepar', config={'model': 'benepar_en3'}) doc = nlp('The quick brown fox jumps over the lazy dog.') for sent in doc.sents: print(sent._.parse_string) # (S (NP (DT The) (JJ quick) (JJ brown) (NN fox)) (VP (VBZ jumps) ...)) ``` ### Word sense disambiguation ```python from nltk.corpus import wordnet from nltk.wsd import lesk context = 'I went to the bank to deposit money' sense = lesk(context.split(), 'bank') print(sense) # Synset('depository_financial_institution.n.01') print(sense.definition()) ``` ### LLM 의 linguistic eval (BLiMP) ```python # 매 BLiMP: 매 67 minimal pair phenomenon def blimp_score(model, blimp_examples): correct = 0 for ex in blimp_examples: ll_good = model.score(ex.acceptable_sentence) ll_bad = model.score(ex.unacceptable_sentence) if ll_good > ll_bad: correct += 1 return correct / len(blimp_examples) ``` ### Multilingual (XLM-R) ```python from transformers import pipeline pipe = pipeline('fill-mask', model='xlm-roberta-large') # 매 zero-shot multilingual print(pipe('Hello, my name is .')) print(pipe('Bonjour, je m\'appelle .')) print(pipe('안녕하세요, 제 이름은 입니다.')) ``` ### Code-switching detection ```python def detect_codeswitch(text, langid_model): """매 sentence 의 multiple language 의 detect.""" tokens = text.split() langs = [langid_model.predict(t) for t in tokens] unique_langs = set(langs) if len(unique_langs) > 1: return f'Code-switching: {unique_langs}' return None ``` ### Linguistic feature extraction (Korean morphology) ```python from konlpy.tag import Mecab mecab = Mecab() text = '나는 학교에 갔다' print(mecab.pos(text)) # [('나', 'NP'), ('는', 'JX'), ('학교', 'NNG'), ('에', 'JKB'), ('가', 'VV'), ('았', 'EP'), ('다', 'EF')] ``` ### Hallucination via syntactic check ```python def syntactic_consistency_check(generated, source_facts): """매 LLM 의 generated 의 매 source 의 entity 의 match?""" gen_doc = nlp(generated) gen_entities = {(ent.text, ent.label_) for ent in gen_doc.ents} source_entities = extract_entities(source_facts) invented = gen_entities - source_entities if invented: return f'Possible hallucination: {invented}' return None ``` ## 🤔 결정 기준 | 응용 | Tool | |---|---| | Production NLP | spaCy / Stanza | | Korean | Mecab / KoNLPy | | State-of-art | Transformers (HF) | | Linguistic phenomenon eval | BLiMP / SuperGLUE | | Multilingual | XLM-R / mBERT | | Low-resource | Parameter-efficient FT | | Discourse | Coref + LLM | | Hallucination | NER + cross-check | **기본값**: spaCy (production) + Transformers (SOTA). ## 🔗 Graph - 부모: [[NLP]] · [[AI]] - 변형: [[Syntax]] · [[Semantics]] · [[Pragmatics]] - 응용: [[Transformer_Architecture_and_LLM_Foundations|BERT]] · [[Transformer_Architecture_and_LLM_Foundations|LLM]] · [[Bag of Words (BoW)]] · [[CLIP]] - Adjacent: [[Articulateness]] · [[Bayesian-Brain-Hypothesis]] · [[Beckett]] (literature) ## 🤖 LLM 활용 **언제**: 매 NLP system 설계. 매 LLM eval 의 linguistic 측. 매 multilingual product. 매 hallucination analysis. **언제 X**: 매 simple text task (LLM 의 enough). ## ❌ 안티패턴 - **English-only assumption**: 매 multilingual fail. - **No morphology** (agglutinative): 매 Korean / Turkish / Finnish 의 fail. - **Statistical era 의 stuck**: 매 LLM 의 leverage X. - **LLM 의 alone (no linguistic eval)**: 매 specific phenomenon 의 miss. ## 🧪 검증 / 중복 - Verified (Jurafsky-Martin "Speech and Language Processing", Manning Stanford NLP). - 신뢰도 A. - Related: [[NLP]] · [[Transformer_Architecture_and_LLM_Foundations|BERT]] · [[Bag of Words (BoW)]] · [[Articulateness]] · [[Bayesian-Brain-Hypothesis]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — layer + history + 매 spaCy / Stanza / BLiMP / XLM-R / Korean code |