"매 language 의 mathematical model". NLP 의 academic 의 root. 매 syntax + semantics + pragmatics + 매 morphology + phonology. 매 modern: 매 LLM 가 dominant 가, 매 linguistics 의 understanding 의 still relevant (eval, hallucination, multilingual).
매 핵심 layer
Phonology / Phonetics
매 sound system.
매 IPA, 매 phoneme.
Morphology
매 word structure.
매 inflection, derivation.
매 agglutinative (Korean, Turkish) vs analytic (Mandarin).
Syntax
매 sentence structure.
매 parser, grammar.
Semantics
매 meaning.
매 word sense, predicate-argument.
Pragmatics
매 context, intent.
매 implicature, speech act.
Discourse
매 multi-sentence, coherence.
Sociolinguistics
매 register, dialect.
매 method history
Symbolic / Rule-based (1950s-80s)
Chomsky transformational grammar.
HPSG, LFG, CCG.
Expert system.
Statistical (1990s-2010s)
Hidden Markov Model (POS).
PCFG (probabilistic CFG).
IBM machine translation.
BLEU metric.
Neural (2010s-2020s)
Word2Vec, GloVe.
LSTM seq2seq.
BERT, GPT.
LLM (2022+)
매 implicit linguistics knowledge.
매 emergent.
매 multilingual zero-shot.
매 task
POS tagging: noun, verb, ...
Parsing: dependency, constituent.
NER: named entity.
Coreference resolution.
Word Sense Disambiguation.
Machine Translation.
Sentiment.
Summarization.
QA.
Dialogue.
매 modern relevance
LLM eval: 매 specific linguistic phenomenon (BLiMP).
Multilingual NLP: 매 typology-aware.
Hallucination analysis: 매 syntax / semantics 의 mismatch.
Low-resource language.
Code-switching.
매 famous resource
WordNet: 매 lexical database.
FrameNet: 매 semantic frames.
PropBank / Penn Treebank.
Universal Dependencies.
CommonCrawl + OSCAR.
💻 패턴
POS tagging (spaCy)
importspacynlp=spacy.load('en_core_web_sm')doc=nlp('The quick brown fox jumps over the lazy dog')fortokenindoc:print(f'{token.text:<10}{token.pos_:<10}{token.tag_}')
Dependency parsing
doc=nlp('Apple is looking at buying U.K. startup for $1 billion')fortokenindoc:print(f'{token.text:<15}{token.dep_:<10} → {token.head.text}')# 매 visualizespacy.displacy.serve(doc,style='dep')
NER
importspacynlp=spacy.load('en_core_web_trf')# 매 transformer-baseddoc=nlp('Apple is looking at buying U.K. startup for $1 billion in 2024')forentindoc.ents:print(f'{ent.text}: {ent.label_}')# Apple: ORG, U.K.: GPE, $1 billion: MONEY, 2024: DATE
Universal Dependencies (Stanza)
importstanzanlp=stanza.Pipeline('en',processors='tokenize,pos,lemma,depparse')doc=nlp('I drove to Berlin yesterday.')forsentindoc.sentences:forwinsent.words:print(f'{w.text:<10}{w.upos:<8} → {sent.words[w.head-1].textifw.head>0else"ROOT"}')
Constituency parsing (benepar)
importbenepar,spacynlp=spacy.load('en_core_web_md')nlp.add_pipe('benepar',config={'model':'benepar_en3'})doc=nlp('The quick brown fox jumps over the lazy dog.')forsentindoc.sents:print(sent._.parse_string)# (S (NP (DT The) (JJ quick) (JJ brown) (NN fox)) (VP (VBZ jumps) ...))
Word sense disambiguation
fromnltk.corpusimportwordnetfromnltk.wsdimportleskcontext='I went to the bank to deposit money'sense=lesk(context.split(),'bank')print(sense)# Synset('depository_financial_institution.n.01')print(sense.definition())
LLM 의 linguistic eval (BLiMP)
# 매 BLiMP: 매 67 minimal pair phenomenondefblimp_score(model,blimp_examples):correct=0forexinblimp_examples:ll_good=model.score(ex.acceptable_sentence)ll_bad=model.score(ex.unacceptable_sentence)ifll_good>ll_bad:correct+=1returncorrect/len(blimp_examples)
Multilingual (XLM-R)
fromtransformersimportpipelinepipe=pipeline('fill-mask',model='xlm-roberta-large')# 매 zero-shot multilingualprint(pipe('Hello, my name is <mask>.'))print(pipe('Bonjour, je m\'appelle <mask>.'))print(pipe('안녕하세요, 제 이름은 <mask>입니다.'))
Code-switching detection
defdetect_codeswitch(text,langid_model):"""매 sentence 의 multiple language 의 detect."""tokens=text.split()langs=[langid_model.predict(t)fortintokens]unique_langs=set(langs)iflen(unique_langs)>1:returnf'Code-switching: {unique_langs}'returnNone
defsyntactic_consistency_check(generated,source_facts):"""매 LLM 의 generated 의 매 source 의 entity 의 match?"""gen_doc=nlp(generated)gen_entities={(ent.text,ent.label_)forentingen_doc.ents}source_entities=extract_entities(source_facts)invented=gen_entities-source_entitiesifinvented:returnf'Possible hallucination: {invented}'returnNone