--- id: AI-SPEECH-REC-001 category: "10_Wiki/๐Ÿ’ก Topics/AI" confidence_score: 1.0 tags: [ai, nlp, speech-recognition, asr, signal-processing, deep-learning, audio-analysis] last_reinforced: 2026-04-26 --- # Speech Recognition Foundations (์Œ์„ฑ ์ธ์‹ ๊ธฐ์ดˆ) ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "๊ณต๊ธฐ๋ฅผ ํƒ€๊ณ  ํ๋ฅด๋Š” ๋น„์ •ํ˜•์˜ ์ŒํŒŒ(Sound Wave)๋ฅผ ์ •๊ตํ•œ ์ˆ˜์น˜์  ํŠน์ง•์œผ๋กœ ํ•ด์ฒดํ•˜๊ณ , ์–ธ์–ด์  ํ†ต๊ณ„์™€ ๋”ฅ๋Ÿฌ๋‹์˜ ๋ฌธ๋งฅ ํŒŒ์•… ๋Šฅ๋ ฅ์„ ๊ฒฐํ•ฉํ•ด 'ํ…์ŠคํŠธ'๋ผ๋Š” ์ง€์‹์˜ ํ˜•์ƒ์œผ๋กœ ๋ณต์›ํ•˜๋ผ" โ€” ์ธ๊ฐ„์˜ ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ์ปดํ“จํ„ฐ๊ฐ€ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ž ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์ž๋™ ์Œ์„ฑ ์ธ์‹(ASR) ๊ธฐ์ˆ ์˜ ๊ทผ๊ฐ„. ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) - **์ถ”์ถœ๋œ ํŒจํ„ด:** "Feature Extraction and Probabilistic Decoding" โ€” ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ์งง์€ ์‹œ๊ฐ„ ๋‹จ์œ„๋กœ ์ž˜๋ผ ์ฃผํŒŒ์ˆ˜ ํŠน์ง•(์˜ˆ: MFCC)์„ ๋ฝ‘์•„๋‚ด๊ณ , ์ด๋ฅผ ์Œํ–ฅ ๋ชจ๋ธ(Acoustic Model)๊ณผ ์–ธ์–ด ๋ชจ๋ธ(Language Model)์— ํ†ต๊ณผ์‹œ์ผœ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ๋‹จ์–ด ์‹œํ€€์Šค๋ฅผ ๋„์ถœํ•˜๋Š” ํŒจํ„ด. - **ํ•ต์‹ฌ ๊ธฐ์ˆ  ์ง„ํ™”:** - **Classic:** HMM(์€๋‹‰ ๋งˆ๋ฅด์ฝ”ํ”„ ๋ชจ๋ธ)๊ณผ GMM์„ ๊ฒฐํ•ฉํ•œ ํ†ต๊ณ„์  ๋ฐฉ์‹. - **End-to-End:** ๋”ฅ๋Ÿฌ๋‹(CNN, RNN, Transformer)์„ ํ™œ์šฉํ•˜์—ฌ ํŠน์ง• ์ถ”์ถœ๋ถ€ํ„ฐ ํ…์ŠคํŠธ ์ƒ์„ฑ๊นŒ์ง€ ํ•˜๋‚˜์˜ ๋ง์œผ๋กœ ์ฒ˜๋ฆฌ (์˜ˆ: CTC, Attention-based). - **Pre-trained Models:** OpenAI Whisper ๋“ฑ ๋ฐฉ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋ฆฌ ํ•™์Šตํ•˜์—ฌ ์†Œ์Œ๊ณผ ์‚ฌํˆฌ๋ฆฌ์— ๊ฐ•ํ•œ ๋ฒ”์šฉ ๋ชจ๋ธ ๋“ฑ์žฅ. - **์˜์˜:** AI ๋น„์„œ, ์ž๋ง‰ ์ž๋™ ์ƒ์„ฑ, ์‹ค์‹œ๊ฐ„ ํ†ต์—ญ ๋“ฑ ์ธ๊ฐ„๊ณผ ๊ธฐ๊ณ„ ์‚ฌ์ด์˜ ๊ฐ€์žฅ ์ง๊ด€์ ์ธ ์†Œํ†ต ์ˆ˜๋‹จ์ธ '๋ง'์„ ๋””์ง€ํ„ธ ์„ธ๊ณ„๋กœ ์—ฐ๊ฒฐํ•˜๋Š” ๊ด€๋ฌธ. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ:** ๋‹จ์ˆœํžˆ ์†Œ๋ฆฌ๋ฅผ ๊ธ€์ž๋กœ ์˜ฎ๊ธฐ๋Š” ๋ฐ ๊ทธ์ณค๋˜ ๊ณผ๊ฑฐ์™€ ๋‹ฌ๋ฆฌ, ์ด์ œ๋Š” ํ™”์ž์˜ ๊ฐ์ •, ์ฃผ๋ณ€ ํ™˜๊ฒฝ์˜ ๋งฅ๋ฝ, ๊ทธ๋ฆฌ๊ณ  ์—ฌ๋Ÿฌ ๋ช…์ด ๋™์‹œ์— ๋งํ•˜๋Š” ์ƒํ™ฉ(Cocktail Party Effect)๊นŒ์ง€ ๋ถ„๋ฆฌํ•ด์„œ ์ธ์‹ํ•˜๋Š” ๊ณ ๋„์˜ ์Œ์„ฑ ์ง€๋Šฅ์œผ๋กœ ๋ฐœ์ „ํ•จ. - **์ •์ฑ… ๋ณ€ํ™”:** Antigravity ํ”„๋กœ์ ํŠธ๋Š” ์—์ด์ „ํŠธ์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ธํ„ฐํŽ˜์ด์Šค ๊ตฌ์ถ• ์‹œ, ๋‹ค๊ตญ์–ด ๋Œ€์‘๊ณผ ์ €์ง€์—ฐ ์ธ์‹์ด ๋ณด์žฅ๋œ ์ตœ์‹  ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ์Œ์„ฑ ์ธ์‹ ์—”์ง„์„ ํ‘œ์ค€์œผ๋กœ ์ฑ„ํƒํ•จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Signal-Processing-Foundations|Signal-Processing-Foundations]], [[Natural-Language-Processing-NLP|Natural-Language-Processing-NLP]], [[Self-Attention-Mechanisms|Self-Attention-Mechanisms]], [[Sequence-to-Sequence-Models|Sequence-to-Sequence-Models]] - **Raw Source:** 10_Wiki/Topics/AI/Speech-Recognition-Foundations.md