---
id: wiki-2026-0508-morphological-and-syntactic-anal
title: Morphological and Syntactic Analysis
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Morphological Analysis, Syntactic Parsing, POS+Parsing, 형태소·구문 분석]
duplicate_of: none
source_trust_level: A
confidence_score: 0.92
verification_status: applied
tags: [nlp, morphology, syntax, parsing, dependency-parsing, korean-nlp, spacy, konlpy]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack: { language: python, framework: spacy-konlpy-stanza }
---

## 한 줄

형태소 분석은 단어를 더 작은 의미 단위(형태소·lemma·stem·affix)로 쪼개고, 구문 분석은 토큰 간 문법적 관계(구성성분/의존관계)를 결정하여 문장의 구조를 트리/그래프로 표현하는 NLP 기초 작업이다.

## 핵심

### 형태소 분석 (Morphological Analysis)
- **Tokenization**: 단어 경계 분리 (공백, 구두점).
- **Lemmatization**: 사전형(lemma)으로 환원 ("running" → "run").
- **Stemming**: 어간 추출 (Porter, Snowball) — 사전형은 아님 ("running" → "run", "studies" → "studi").
- **POS tagging**: 명사/동사/형용사/조사 등.
- **Morphological features**: 시제, 수, 격, 인칭 (UD feature scheme).

### 언어 유형별
- **굴절어 (영어, 독일어)**: 어미 변화 — lemmatizer 핵심.
- **교착어 (한국어, 일본어, 터키어)**: 어간 + 다중 접사 → 형태소 분석기 필수.
- **고립어 (중국어)**: 형태 변화 적음 → segmentation 핵심.
- **포합어 (이누이트어)**: 한 단어 = 한 문장 — 매우 어렵.

### 한국어 특수성
- 어절 ≠ 단어. "먹었습니다" = 먹/VV + 었/EP + 습니다/EF.
- 형태소 분석기: KoNLPy (Mecab, Komoran, Kkma, Okt, Hannanum), khaiii, kiwi.
- 모호성 해소: "감기" (cold/winding) — 문맥 의존.

### 구문 분석 (Syntactic Analysis)
- **Constituency parsing**: 구성성분 트리 (NP, VP, PP). CFG/PCFG 기반.
- **Dependency parsing**: head ← dependent 관계 그래프. Universal Dependencies 표준.
- **Transition-based parser**: shift-reduce, MaltParser, BiAffine.
- **Graph-based parser**: MST, Eisner algorithm.
- **Neural parser**: Stanza, spaCy, Trankit — BiLSTM/Transformer + biaffine attention.

### 현대(2024-26) 위치
- LLM(GPT, Claude)은 명시적 파싱 없이도 깊은 구문 이해 표시.
- 그러나 정보 추출/문법 검사/언어학 연구엔 명시적 파싱이 여전히 유용.
- Stanza, spaCy 3.x, UDPipe 2가 표준.

### 응용
- 정보 추출, NER, 관계 추출의 전처리.
- 문법 검사 (Grammarly).
- 기계번역의 syntactic transfer.
- 검색 엔진 형태소 인덱싱 (한국어 ElasticSearch nori, Mecab).

## 💻 패턴

```python
# 1. spaCy — 영어 형태소 + 의존 파싱 한 번에
import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("The quick brown foxes were jumping over the lazy dogs.")
for tok in doc:
    print(f"{tok.text:12} lemma={tok.lemma_:8} pos={tok.pos_:6} "
          f"dep={tok.dep_:10} head={tok.head.text}")
```

```python
# 2. NLTK — Porter / Snowball stemmer
from nltk.stem import PorterStemmer, SnowballStemmer
ps = PorterStemmer()
print(ps.stem("running"))  # 'run'
print(ps.stem("studies"))  # 'studi'  ← lemma 아님!
sb = SnowballStemmer("english")
```

```python
# 3. KoNLPy — 한국어 형태소
from konlpy.tag import Mecab, Okt, Komoran
mecab = Mecab()
print(mecab.pos("아버지가 방에 들어가신다."))
# [('아버지','NNG'),('가','JKS'),('방','NNG'),('에','JKB'),
#  ('들어가','VV'),('신다','EP+EF')]
```

```python
# 4. kiwi (속도+정확도 균형, 한국어 2024)
from kiwipiepy import Kiwi
kiwi = Kiwi()
result = kiwi.tokenize("나는 학교에 갑니다.")
for t in result:
    print(t.form, t.tag, t.start, t.len)
```

```python
# 5. Stanza — 70+ 언어 신경 파서
import stanza
stanza.download("ko")
nlp = stanza.Pipeline("ko", processors="tokenize,pos,lemma,depparse")
doc = nlp("나는 책을 읽었다.")
for sent in doc.sentences:
    for word in sent.words:
        print(word.text, word.upos, word.feats, word.head, word.deprel)
```

```python
# 6. Constituency parsing — Berkeley Neural Parser
import benepar, spacy
nlp = spacy.load("en_core_web_md")
nlp.add_pipe("benepar", config={"model": "benepar_en3"})
doc = nlp("The quick brown fox jumps over the lazy dog.")
sent = list(doc.sents)[0]
print(sent._.parse_string)
```

```python
# 7. Dependency 시각화 — displaCy
from spacy import displacy
displacy.serve(doc, style="dep")
```

```python
# 8. UD features 활용 — 능동/수동 판별
def is_passive(token):
    return any(c.dep_ == "auxpass" for c in token.children)

for tok in doc:
    if tok.pos_ == "VERB" and is_passive(tok):
        print(f"Passive verb: {tok.text}")
```

```python
# 9. ElasticSearch nori (한국어 색인)
# 매핑:
#   "analyzer": {"my_nori": {"type":"custom","tokenizer":"nori_tokenizer"}}
# nori_tokenizer가 mecab-ko-dic 활용해 형태소 단위 색인
```

```python
# 10. LLM에게 파싱 — 구조화 출력
import json
prompt = """Tokenize and tag (Universal Dependencies) the sentence.
Return JSON: [{"text":..., "lemma":..., "upos":..., "head":..., "deprel":...}]
Sentence: She quickly read the book yesterday."""
# Claude/GPT-4 응답을 json.loads로 파싱
```

## 결정 기준

| 작업 | 추천 도구 |
|---|---|
| 영어 production | spaCy `en_core_web_trf` |
| 다국어 (70+) | Stanza |
| 한국어 빠른 색인 | Mecab-ko / nori |
| 한국어 정확도 우선 | kiwi, Komoran |
| 구성성분 트리 필요 | benepar (Berkeley parser) |
| 학술/언어학 연구 | UD treebank + Stanza/UDPipe |
| 단순 stem만 필요 | NLTK Snowball |
| 영문 lemma만 | spaCy lemmatizer (lookup) |

기본값: spaCy(영어) / kiwi 또는 Mecab(한국어) / Stanza(기타).

## 🔗 Graph
- 부모: [[Natural-Language-Processing-NLP|Natural-Language-Processing]], [[Computational-Linguistics]]
- 형제: [[Tokenization]], [[Named-Entity-Recognition-NER]]
- 자식: [[Stemming]]

## 🤖 LLM 활용
- LLM에게 UD 형식 출력 요청 — zero-shot으로도 상당한 정확도.
- 구조화 출력(JSON) + spaCy 파이프 통합으로 fine-tuning 없이 도메인 적응.
- 한국어처럼 형태소 분석이 핵심인 언어는 여전히 전용 도구가 정확.

## ❌ 안티패턴
- Stemmer 결과를 사용자에게 직접 노출 ("studi" 같은 비단어).
- 영어 lemma를 한국어에 적용 (어절 단위로 lemmatize).
- 단순 공백 split을 한국어/일본어에 적용.
- LLM 파싱 결과를 검증 없이 다운스트림 투입 — 일관성 부족.

## 🧪 검증 / 중복
- 평가: UAS/LAS (의존 파싱), F1 (구성성분).
- UD treebank 표준 데이터셋으로 비교.
- 별칭: [[Morphological-Analysis]], [[Syntactic-Parsing]] — 본 문서로 통합 가능.

## 🕓 Changelog
- Phase 1 (2026-05-08): 초기 생성.
- Manual cleanup (2026-05-10): canonical 확정, 한국어 도구(kiwi, mecab) 정리, UD 기반 패턴, LLM 파싱 추가.