Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

7.6 KiB

Raw Permalink Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Computational Linguistics

매 한 줄

"매 language 의 mathematical model". NLP 의 academic 의 root. 매 syntax + semantics + pragmatics + 매 morphology + phonology. 매 modern: 매 LLM 가 dominant 가, 매 linguistics 의 understanding 의 still relevant (eval, hallucination, multilingual).

매 핵심 layer

Phonology / Phonetics

매 sound system.
매 IPA, 매 phoneme.

Morphology

매 word structure.
매 inflection, derivation.
매 agglutinative (Korean, Turkish) vs analytic (Mandarin).

Syntax

매 sentence structure.
매 parser, grammar.

Semantics

매 meaning.
매 word sense, predicate-argument.

Pragmatics

매 context, intent.
매 implicature, speech act.

Discourse

매 multi-sentence, coherence.

Sociolinguistics

매 register, dialect.

매 method history

Symbolic / Rule-based (1950s-80s)

Chomsky transformational grammar.
HPSG, LFG, CCG.
Expert system.

Statistical (1990s-2010s)

Hidden Markov Model (POS).
PCFG (probabilistic CFG).
IBM machine translation.
BLEU metric.

Neural (2010s-2020s)

Word2Vec, GloVe.
LSTM seq2seq.
BERT, GPT.

LLM (2022+)

매 implicit linguistics knowledge.
매 emergent.
매 multilingual zero-shot.

매 task

POS tagging: noun, verb, ...
Parsing: dependency, constituent.
NER: named entity.
Coreference resolution.
Word Sense Disambiguation.
Machine Translation.
Sentiment.
Summarization.
QA.
Dialogue.

매 modern relevance

LLM eval: 매 specific linguistic phenomenon (BLiMP).
Multilingual NLP: 매 typology-aware.
Hallucination analysis: 매 syntax / semantics 의 mismatch.
Low-resource language.
Code-switching.

매 famous resource

WordNet: 매 lexical database.
FrameNet: 매 semantic frames.
PropBank / Penn Treebank.
Universal Dependencies.
CommonCrawl + OSCAR.

💻 패턴

POS tagging (spaCy)

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp('The quick brown fox jumps over the lazy dog')
for token in doc:
    print(f'{token.text:<10} {token.pos_:<10} {token.tag_}')

Dependency parsing

doc = nlp('Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(f'{token.text:<15} {token.dep_:<10} → {token.head.text}')

# 매 visualize
spacy.displacy.serve(doc, style='dep')

NER

import spacy
nlp = spacy.load('en_core_web_trf')  # 매 transformer-based
doc = nlp('Apple is looking at buying U.K. startup for $1 billion in 2024')
for ent in doc.ents:
    print(f'{ent.text}: {ent.label_}')
# Apple: ORG, U.K.: GPE, $1 billion: MONEY, 2024: DATE

Universal Dependencies (Stanza)

import stanza
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse')
doc = nlp('I drove to Berlin yesterday.')
for sent in doc.sentences:
    for w in sent.words:
        print(f'{w.text:<10} {w.upos:<8} → {sent.words[w.head-1].text if w.head > 0 else "ROOT"}')

Constituency parsing (benepar)

import benepar, spacy
nlp = spacy.load('en_core_web_md')
nlp.add_pipe('benepar', config={'model': 'benepar_en3'})
doc = nlp('The quick brown fox jumps over the lazy dog.')
for sent in doc.sents:
    print(sent._.parse_string)
# (S (NP (DT The) (JJ quick) (JJ brown) (NN fox)) (VP (VBZ jumps) ...))

Word sense disambiguation

from nltk.corpus import wordnet
from nltk.wsd import lesk

context = 'I went to the bank to deposit money'
sense = lesk(context.split(), 'bank')
print(sense)        # Synset('depository_financial_institution.n.01')
print(sense.definition())

LLM 의 linguistic eval (BLiMP)

# 매 BLiMP: 매 67 minimal pair phenomenon
def blimp_score(model, blimp_examples):
    correct = 0
    for ex in blimp_examples:
        ll_good = model.score(ex.acceptable_sentence)
        ll_bad = model.score(ex.unacceptable_sentence)
        if ll_good > ll_bad: correct += 1
    return correct / len(blimp_examples)

Multilingual (XLM-R)

from transformers import pipeline
pipe = pipeline('fill-mask', model='xlm-roberta-large')

# 매 zero-shot multilingual
print(pipe('Hello, my name is <mask>.'))
print(pipe('Bonjour, je m\'appelle <mask>.'))
print(pipe('안녕하세요, 제 이름은 <mask>입니다.'))

Code-switching detection

def detect_codeswitch(text, langid_model):
    """매 sentence 의 multiple language 의 detect."""
    tokens = text.split()
    langs = [langid_model.predict(t) for t in tokens]
    unique_langs = set(langs)
    if len(unique_langs) > 1:
        return f'Code-switching: {unique_langs}'
    return None

Linguistic feature extraction (Korean morphology)

from konlpy.tag import Mecab
mecab = Mecab()

text = '나는 학교에 갔다'
print(mecab.pos(text))
# [('나', 'NP'), ('는', 'JX'), ('학교', 'NNG'), ('에', 'JKB'), ('가', 'VV'), ('았', 'EP'), ('다', 'EF')]

Hallucination via syntactic check

def syntactic_consistency_check(generated, source_facts):
    """매 LLM 의 generated 의 매 source 의 entity 의 match?"""
    gen_doc = nlp(generated)
    gen_entities = {(ent.text, ent.label_) for ent in gen_doc.ents}
    
    source_entities = extract_entities(source_facts)
    
    invented = gen_entities - source_entities
    if invented:
        return f'Possible hallucination: {invented}'
    return None

🤔 결정 기준

응용	Tool
Production NLP	spaCy / Stanza
Korean	Mecab / KoNLPy
State-of-art	Transformers (HF)
Linguistic phenomenon eval	BLiMP / SuperGLUE
Multilingual	XLM-R / mBERT
Low-resource	Parameter-efficient FT
Discourse	Coref + LLM
Hallucination	NER + cross-check

기본값: spaCy (production) + Transformers (SOTA).

🔗 Graph

부모: NLP · AI
변형: Syntax · Semantics · Pragmatics
응용: Transformer_Architecture_and_LLM_Foundations · Transformer_Architecture_and_LLM_Foundations · Bag of Words (BoW) · CLIP
Adjacent: Articulateness · Bayesian-Brain-Hypothesis · Beckett (literature)

🤖 LLM 활용

언제: 매 NLP system 설계. 매 LLM eval 의 linguistic 측. 매 multilingual product. 매 hallucination analysis. 언제 X: 매 simple text task (LLM 의 enough).

❌ 안티패턴

English-only assumption: 매 multilingual fail.
No morphology (agglutinative): 매 Korean / Turkish / Finnish 의 fail.
Statistical era 의 stuck: 매 LLM 의 leverage X.
LLM 의 alone (no linguistic eval): 매 specific phenomenon 의 miss.

🧪 검증 / 중복

Verified (Jurafsky-Martin "Speech and Language Processing", Manning Stanford NLP).
신뢰도 A.
Related: NLP · Transformer_Architecture_and_LLM_Foundations · Bag of Words (BoW) · Articulateness · Bayesian-Brain-Hypothesis.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — layer + history + 매 spaCy / Stanza / BLiMP / XLM-R / Korean code

7.6 KiB Raw Permalink Blame History