f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
185 lines
5.3 KiB
Markdown
185 lines
5.3 KiB
Markdown
---
|
|
id: wiki-2026-0508-deep-grammar
|
|
title: Deep Grammar
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [deep grammar, generative grammar, Chomsky hierarchy, universal grammar, syntactic structures]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.88
|
|
verification_status: applied
|
|
tags: [linguistics, chomsky, generative-grammar, syntax, nlp, formal-language]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: NLP / Formal Language
|
|
applicable_to: [Linguistics, Compiler, NLP, LLM]
|
|
---
|
|
|
|
# Deep Grammar
|
|
|
|
## 매 한 줄
|
|
> **"매 surface sentence 의 underlying structure"**. Chomsky 의 generative grammar — 매 finite rule 의 infinite sentence 의 produce. 매 deep structure (meaning) ↔ surface structure (form). 매 modern: 매 LLM 의 implicit 의 learn (no explicit grammar).
|
|
|
|
## 매 핵심
|
|
|
|
### 매 Chomsky hierarchy
|
|
1. **Type 0** (Recursively enumerable): 매 Turing-complete.
|
|
2. **Type 1** (Context-sensitive): 매 a^n b^n c^n.
|
|
3. **Type 2** (Context-free): 매 programming language.
|
|
4. **Type 3** (Regular): 매 regex.
|
|
|
|
### 매 deep vs surface
|
|
- **Deep structure**: 매 meaning representation.
|
|
- **Surface**: 매 spoken / written form.
|
|
- **Transformation**: 매 active ↔ passive.
|
|
|
|
### 매 universal grammar (UG)
|
|
- 매 innate language faculty (Chomsky).
|
|
- 매 parameter setting (head-initial vs head-final).
|
|
- 매 critical period.
|
|
|
|
### 매 modern stance
|
|
- **Pre-LLM**: 매 explicit rule (CFG, dependency grammar).
|
|
- **Post-LLM**: 매 implicit (transformer 의 attention 의 learn).
|
|
- **Hybrid**: 매 LLM + grammar constraint (decoding).
|
|
|
|
### 매 응용
|
|
1. **Parsing**: 매 syntax tree.
|
|
2. **Compiler**: 매 BNF / EBNF.
|
|
3. **NLP**: 매 POS tag, dependency.
|
|
4. **Code completion**: 매 grammar-guided LLM.
|
|
5. **DSL**: 매 ANTLR / Tree-sitter.
|
|
6. **Constrained decoding**: 매 JSON schema 의 LLM.
|
|
|
|
## 💻 패턴
|
|
|
|
### CFG with NLTK
|
|
```python
|
|
import nltk
|
|
grammar = nltk.CFG.fromstring("""
|
|
S -> NP VP
|
|
NP -> Det N | Det N PP
|
|
VP -> V NP | V NP PP
|
|
PP -> P NP
|
|
Det -> 'the' | 'a'
|
|
N -> 'dog' | 'cat' | 'park'
|
|
V -> 'saw' | 'chased'
|
|
P -> 'in' | 'with'
|
|
""")
|
|
parser = nltk.ChartParser(grammar)
|
|
for tree in parser.parse('the dog saw a cat in the park'.split()):
|
|
tree.pretty_print()
|
|
```
|
|
|
|
### Dependency parsing (spaCy)
|
|
```python
|
|
import spacy
|
|
nlp = spacy.load('en_core_web_sm')
|
|
doc = nlp("The cat sat on the mat.")
|
|
for token in doc:
|
|
print(f'{token.text:10} {token.dep_:10} {token.head.text}')
|
|
```
|
|
|
|
### Tree-sitter grammar (DSL)
|
|
```javascript
|
|
module.exports = grammar({
|
|
name: 'mylang',
|
|
rules: {
|
|
source_file: $ => repeat($._statement),
|
|
_statement: $ => choice($.assignment, $.function_call),
|
|
assignment: $ => seq($.identifier, '=', $._expression),
|
|
identifier: $ => /[a-zA-Z_][a-zA-Z0-9_]*/,
|
|
// ...
|
|
}
|
|
});
|
|
```
|
|
|
|
### Constrained LLM decoding (grammar-guided)
|
|
```python
|
|
from outlines import models, generate
|
|
model = models.transformers('gpt2')
|
|
|
|
# 매 regex constraint
|
|
generator = generate.regex(model, r'\d{4}-\d{2}-\d{2}')
|
|
print(generator('Date: '))
|
|
|
|
# 매 JSON schema
|
|
from pydantic import BaseModel
|
|
class User(BaseModel):
|
|
name: str
|
|
age: int
|
|
gen = generate.json(model, User)
|
|
```
|
|
|
|
### PEG parser
|
|
```python
|
|
# parsimonious
|
|
from parsimonious.grammar import Grammar
|
|
grammar = Grammar(r"""
|
|
expr = term (("+" / "-") term)*
|
|
term = factor (("*" / "/") factor)*
|
|
factor = number / "(" expr ")"
|
|
number = ~"[0-9]+"
|
|
""")
|
|
tree = grammar.parse("3 + 4 * 2")
|
|
```
|
|
|
|
### Chomsky-Normal-Form CYK
|
|
```python
|
|
def cyk(words, grammar):
|
|
n = len(words)
|
|
table = [[set() for _ in range(n)] for _ in range(n)]
|
|
for i, w in enumerate(words):
|
|
for lhs, rhs in grammar:
|
|
if rhs == (w,): table[i][i].add(lhs)
|
|
for length in range(2, n + 1):
|
|
for i in range(n - length + 1):
|
|
j = i + length - 1
|
|
for k in range(i, j):
|
|
for lhs, rhs in grammar:
|
|
if len(rhs) == 2 and rhs[0] in table[i][k] and rhs[1] in table[k+1][j]:
|
|
table[i][j].add(lhs)
|
|
return 'S' in table[0][n-1]
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Programming language | CFG / PEG |
|
|
| NLP parsing | Dependency (spaCy) |
|
|
| LLM output structure | Constrained decoding |
|
|
| Custom DSL | Tree-sitter |
|
|
| Compiler frontend | ANTLR / yacc |
|
|
| Linguistics research | UG / minimalist |
|
|
|
|
**기본값**: 매 LLM era — 매 implicit grammar (transformer) + 매 constrained decoding 의 critical output.
|
|
|
|
## 🔗 Graph
|
|
- 변형: [[Generative-Grammar]] · [[Universal-Grammar]]
|
|
- 응용: [[Domain-Specific-Languages]] · [[NLP]]
|
|
- Adjacent: [[Transformer_Architecture_and_LLM_Foundations|LLM]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 syntactic analysis. 매 grammar-guided generation. 매 DSL design.
|
|
**언제 X**: 매 free-form text. 매 zero-shot LLM.
|
|
|
|
## ❌ 안티패턴
|
|
- **Over-rigid grammar**: 매 LLM 의 advantage 의 lose.
|
|
- **Ignore ambiguity**: 매 parse multiple.
|
|
- **Deep ≠ semantic**: 매 modern view 의 separate.
|
|
- **No constraint at decode**: 매 invalid output.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Chomsky, formal language theory).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-04-20 | Auto-reinforced |
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — Chomsky hierarchy + 매 NLTK / spaCy / tree-sitter / constrained decode code |
|