--- id: wiki-2026-0508-deep-grammar title: Deep Grammar category: 10_Wiki/Topics status: verified canonical_id: self aliases: [deep grammar, generative grammar, Chomsky hierarchy, universal grammar, syntactic structures] duplicate_of: none source_trust_level: A confidence_score: 0.88 verification_status: applied tags: [linguistics, chomsky, generative-grammar, syntax, nlp, formal-language] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: NLP / Formal Language applicable_to: [Linguistics, Compiler, NLP, LLM] --- # Deep Grammar ## 매 한 줄 > **"매 surface sentence 의 underlying structure"**. Chomsky 의 generative grammar — 매 finite rule 의 infinite sentence 의 produce. 매 deep structure (meaning) ↔ surface structure (form). 매 modern: 매 LLM 의 implicit 의 learn (no explicit grammar). ## 매 핵심 ### 매 Chomsky hierarchy 1. **Type 0** (Recursively enumerable): 매 Turing-complete. 2. **Type 1** (Context-sensitive): 매 a^n b^n c^n. 3. **Type 2** (Context-free): 매 programming language. 4. **Type 3** (Regular): 매 regex. ### 매 deep vs surface - **Deep structure**: 매 meaning representation. - **Surface**: 매 spoken / written form. - **Transformation**: 매 active ↔ passive. ### 매 universal grammar (UG) - 매 innate language faculty (Chomsky). - 매 parameter setting (head-initial vs head-final). - 매 critical period. ### 매 modern stance - **Pre-LLM**: 매 explicit rule (CFG, dependency grammar). - **Post-LLM**: 매 implicit (transformer 의 attention 의 learn). - **Hybrid**: 매 LLM + grammar constraint (decoding). ### 매 응용 1. **Parsing**: 매 syntax tree. 2. **Compiler**: 매 BNF / EBNF. 3. **NLP**: 매 POS tag, dependency. 4. **Code completion**: 매 grammar-guided LLM. 5. **DSL**: 매 ANTLR / Tree-sitter. 6. **Constrained decoding**: 매 JSON schema 의 LLM. ## 💻 패턴 ### CFG with NLTK ```python import nltk grammar = nltk.CFG.fromstring(""" S -> NP VP NP -> Det N | Det N PP VP -> V NP | V NP PP PP -> P NP Det -> 'the' | 'a' N -> 'dog' | 'cat' | 'park' V -> 'saw' | 'chased' P -> 'in' | 'with' """) parser = nltk.ChartParser(grammar) for tree in parser.parse('the dog saw a cat in the park'.split()): tree.pretty_print() ``` ### Dependency parsing (spaCy) ```python import spacy nlp = spacy.load('en_core_web_sm') doc = nlp("The cat sat on the mat.") for token in doc: print(f'{token.text:10} {token.dep_:10} {token.head.text}') ``` ### Tree-sitter grammar (DSL) ```javascript module.exports = grammar({ name: 'mylang', rules: { source_file: $ => repeat($._statement), _statement: $ => choice($.assignment, $.function_call), assignment: $ => seq($.identifier, '=', $._expression), identifier: $ => /[a-zA-Z_][a-zA-Z0-9_]*/, // ... } }); ``` ### Constrained LLM decoding (grammar-guided) ```python from outlines import models, generate model = models.transformers('gpt2') # 매 regex constraint generator = generate.regex(model, r'\d{4}-\d{2}-\d{2}') print(generator('Date: ')) # 매 JSON schema from pydantic import BaseModel class User(BaseModel): name: str age: int gen = generate.json(model, User) ``` ### PEG parser ```python # parsimonious from parsimonious.grammar import Grammar grammar = Grammar(r""" expr = term (("+" / "-") term)* term = factor (("*" / "/") factor)* factor = number / "(" expr ")" number = ~"[0-9]+" """) tree = grammar.parse("3 + 4 * 2") ``` ### Chomsky-Normal-Form CYK ```python def cyk(words, grammar): n = len(words) table = [[set() for _ in range(n)] for _ in range(n)] for i, w in enumerate(words): for lhs, rhs in grammar: if rhs == (w,): table[i][i].add(lhs) for length in range(2, n + 1): for i in range(n - length + 1): j = i + length - 1 for k in range(i, j): for lhs, rhs in grammar: if len(rhs) == 2 and rhs[0] in table[i][k] and rhs[1] in table[k+1][j]: table[i][j].add(lhs) return 'S' in table[0][n-1] ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Programming language | CFG / PEG | | NLP parsing | Dependency (spaCy) | | LLM output structure | Constrained decoding | | Custom DSL | Tree-sitter | | Compiler frontend | ANTLR / yacc | | Linguistics research | UG / minimalist | **기본값**: 매 LLM era — 매 implicit grammar (transformer) + 매 constrained decoding 의 critical output. ## 🔗 Graph - 변형: [[Generative-Grammar]] · [[Universal-Grammar]] - 응용: [[Domain-Specific-Languages]] · [[NLP]] - Adjacent: [[Transformer_Architecture_and_LLM_Foundations|LLM]] ## 🤖 LLM 활용 **언제**: 매 syntactic analysis. 매 grammar-guided generation. 매 DSL design. **언제 X**: 매 free-form text. 매 zero-shot LLM. ## ❌ 안티패턴 - **Over-rigid grammar**: 매 LLM 의 advantage 의 lose. - **Ignore ambiguity**: 매 parse multiple. - **Deep ≠ semantic**: 매 modern view 의 separate. - **No constraint at decode**: 매 invalid output. ## 🧪 검증 / 중복 - Verified (Chomsky, formal language theory). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-04-20 | Auto-reinforced | | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Chomsky hierarchy + 매 NLTK / spaCy / tree-sitter / constrained decode code |