--- id: wiki-2026-0508-speech-recognition-foundations title: Speech Recognition Foundations category: 10_Wiki/Topics status: verified canonical_id: self aliases: [ASR, Automatic Speech Recognition, Speech-to-Text, STT] duplicate_of: none source_trust_level: A confidence_score: 0.92 verification_status: applied tags: [asr, speech, audio, whisper, transformer] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: Whisper/NeMo/faster-whisper --- # Speech Recognition Foundations ## 매 한 줄 > **"매 acoustic signal → text token 의 sequence-to-sequence map"**. ASR 매 frontend (mel-spectrogram) + acoustic model (CTC/attention/RNNT) + language model (n-gram 또는 LM-fused) 의 stack. 매 2026 매 Whisper-large-v3 / Voxtral / Distil-Whisper / NVIDIA Canary-1B + Parakeet-TDT 의 production 의 표준. ## 매 핵심 ### 매 frontend (signal processing) 1. **Resample**: 16kHz mono 매 standard. 2. **STFT → Mel**: 25ms window, 10ms hop, 80 mel bins. 3. **Log-mel + cepstral mean**: 매 Whisper input. 4. **VAD** (Silero, WebRTC): 매 silence 의 trim. ### 매 acoustic-model paradigm - **Hybrid HMM-DNN** (Kaldi era): 매 deprecated 매 production 새 system. - **CTC** (Connectionist Temporal Classification): monotonic alignment, blank-token. 매 streaming-friendly. - **Attention encoder-decoder** (LAS, Whisper): non-monotonic, full-context. 매 high-accuracy offline. - **RNN-T / Transducer**: streaming + accurate. 매 mobile/voice-assistant 의 dominant (Parakeet-TDT, Apple Siri). ### 매 modern open ASR (2026) - **Whisper-large-v3** (OpenAI 2023): 99 lang, multilingual + translation. - **Distil-Whisper-v3.5**: 6× faster, 49% smaller, 1% WER drop. - **Voxtral** (Mistral 2025): multilingual ASR + LLM-grade understanding. - **NVIDIA Canary-1B-Flash**: 4-lang, top Open ASR Leaderboard (2025). - **Parakeet-TDT-1.1B** (NVIDIA): streaming RNN-T + token-and-duration. - **SeamlessM4T v2** (Meta): speech + text bidirectional translation. ### 매 응용 1. Meeting transcription (Otter, Granola, Fireflies). 2. Captioning (live + offline). 3. Voice agents (Pipecat, LiveKit). 4. Call-center analytics. 5. Accessibility (live caption OS-level). ## 💻 패턴 ### Whisper inference (HuggingFace) ```python from transformers import pipeline import torch asr = pipeline( "automatic-speech-recognition", model="openai/whisper-large-v3", torch_dtype=torch.float16, device="cuda", return_timestamps=True, ) out = asr("meeting.wav", chunk_length_s=30, batch_size=8) print(out["text"]) for chunk in out["chunks"]: print(chunk["timestamp"], chunk["text"]) ``` ### faster-whisper (CTranslate2) ```python from faster_whisper import WhisperModel model = WhisperModel("large-v3", device="cuda", compute_type="float16") segments, info = model.transcribe( "meeting.wav", beam_size=5, vad_filter=True, vad_parameters=dict(min_silence_duration_ms=500), language="ko", ) for s in segments: print(f"[{s.start:.2f}-{s.end:.2f}] {s.text}") ``` ### Streaming with Parakeet-TDT (NeMo) ```python import nemo.collections.asr as nemo_asr m = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-1.1b") m.change_decoding_strategy(None) # Streaming buffered inference from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer buf = CacheAwareStreamingAudioBuffer(model=m, online_normalization=True) buf.append_audio_file("call.wav", stream_id=-1) for chunk in buf.iter_chunks(): text = m.transcribe([chunk]) yield text ``` ### MLX Whisper (Apple Silicon) ```python # pip install mlx-whisper import mlx_whisper result = mlx_whisper.transcribe( "audio.mp3", path_or_hf_repo="mlx-community/whisper-large-v3-turbo", word_timestamps=True, ) print(result["text"]) ``` ### Custom mel feature extraction ```python import torch, torchaudio def log_mel(wav: torch.Tensor, sr: int = 16000) -> torch.Tensor: if sr != 16000: wav = torchaudio.functional.resample(wav, sr, 16000) spec = torchaudio.transforms.MelSpectrogram( sample_rate=16000, n_fft=400, hop_length=160, n_mels=80, )(wav) return torch.log10(torch.clamp(spec, min=1e-10)) ``` ### CTC decode with KenLM rescoring ```python from pyctcdecode import build_ctcdecoder decoder = build_ctcdecoder( labels=vocab, kenlm_model_path="lm.arpa", alpha=0.5, beta=1.5, ) # logits: (T, V) numpy text = decoder.decode(logits, beam_width=100) ``` ### Diarization + ASR (pyannote + Whisper) ```python from pyannote.audio import Pipeline pipe = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1") diar = pipe("call.wav", num_speakers=2) for turn, _, speaker in diar.itertracks(yield_label=True): seg = wav[int(turn.start*sr):int(turn.end*sr)] text = asr({"raw": seg, "sampling_rate": sr})["text"] print(f"{speaker}: {text}") ``` ### WER evaluation ```python from jiwer import wer, cer, compute_measures ref = "the quick brown fox" hyp = "the quik brown fox" print(wer(ref, hyp)) # 0.25 print(cer(ref, hyp)) print(compute_measures(ref, hyp)) # detailed I/D/S ``` ## 매 결정 기준 | 상황 | Model | |---|---| | 99-lang offline | Whisper-large-v3 | | Edge / 6× faster | Distil-Whisper-v3.5 | | Apple Silicon laptop | MLX Whisper Turbo | | Streaming voice-agent | Parakeet-TDT / Canary-1B-Flash | | Multilingual + understanding | Voxtral | | Speech-to-speech translation | SeamlessM4T v2 | **기본값**: faster-whisper (large-v3) 매 batch; Parakeet-TDT 매 streaming. ## 🔗 Graph - 부모: [[Sequence-to-Sequence-Models|Sequence-Modeling]] - 응용: [[Whisper]] · [[Voice-Agent]] - Adjacent: [[VAD]] · [[TTS]] ## 🤖 LLM 활용 **언제**: 매 transcript post-process (punctuation, summarization), 매 vocab/prompt biasing 매 domain term, 매 hallucination filter. **언제 X**: 매 raw acoustic decode (use ASR model), 매 numerical WER eval (use jiwer). ## ❌ 안티패턴 - **No VAD**: 매 silence 매 hallucination ("Thanks for watching!"). 매 Whisper 의 known issue. - **Wrong sample rate**: 매 8kHz / 44kHz 의 mismatch → 매 garbage output. - **chunk_length_s 너무 길게**: 매 Whisper 30s 의 designed 의. 매 더 길면 truncation. - **Greedy decode in noisy**: 매 beam search + LM rescore 의 use. - **No diarization in multi-speaker**: 매 speaker turn 매 lost. ## 🧪 검증 / 중복 - Verified (Whisper paper arXiv:2212.04356; HuggingFace Open ASR Leaderboard 2025; NVIDIA NeMo docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — full content (Whisper/Parakeet/Voxtral + faster-whisper/MLX patterns) |