"매 acoustic signal → text token 의 sequence-to-sequence map". ASR 매 frontend (mel-spectrogram) + acoustic model (CTC/attention/RNNT) + language model (n-gram 또는 LM-fused) 의 stack. 매 2026 매 Whisper-large-v3 / Voxtral / Distil-Whisper / NVIDIA Canary-1B + Parakeet-TDT 의 production 의 표준.
매 핵심
매 frontend (signal processing)
Resample: 16kHz mono 매 standard.
STFT → Mel: 25ms window, 10ms hop, 80 mel bins.
Log-mel + cepstral mean: 매 Whisper input.
VAD (Silero, WebRTC): 매 silence 의 trim.
매 acoustic-model paradigm
Hybrid HMM-DNN (Kaldi era): 매 deprecated 매 production 새 system.
CTC (Connectionist Temporal Classification): monotonic alignment, blank-token. 매 streaming-friendly.
Attention encoder-decoder (LAS, Whisper): non-monotonic, full-context. 매 high-accuracy offline.
RNN-T / Transducer: streaming + accurate. 매 mobile/voice-assistant 의 dominant (Parakeet-TDT, Apple Siri).
fromjiwerimportwer,cer,compute_measuresref="the quick brown fox"hyp="the quik brown fox"print(wer(ref,hyp))# 0.25print(cer(ref,hyp))print(compute_measures(ref,hyp))# detailed I/D/S
매 결정 기준
상황
Model
99-lang offline
Whisper-large-v3
Edge / 6× faster
Distil-Whisper-v3.5
Apple Silicon laptop
MLX Whisper Turbo
Streaming voice-agent
Parakeet-TDT / Canary-1B-Flash
Multilingual + understanding
Voxtral
Speech-to-speech translation
SeamlessM4T v2
기본값: faster-whisper (large-v3) 매 batch; Parakeet-TDT 매 streaming.
언제: 매 transcript post-process (punctuation, summarization), 매 vocab/prompt biasing 매 domain term, 매 hallucination filter.
언제 X: 매 raw acoustic decode (use ASR model), 매 numerical WER eval (use jiwer).
❌ 안티패턴
No VAD: 매 silence 매 hallucination ("Thanks for watching!"). 매 Whisper 의 known issue.
Wrong sample rate: 매 8kHz / 44kHz 의 mismatch → 매 garbage output.
chunk_length_s 너무 길게: 매 Whisper 30s 의 designed 의. 매 더 길면 truncation.
Greedy decode in noisy: 매 beam search + LM rescore 의 use.
No diarization in multi-speaker: 매 speaker turn 매 lost.
🧪 검증 / 중복
Verified (Whisper paper arXiv:2212.04356; HuggingFace Open ASR Leaderboard 2025; NVIDIA NeMo docs).
신뢰도 A.
🕓 Changelog
날짜
변경
2026-05-08
Phase 1
2026-05-10
Manual cleanup — full content (Whisper/Parakeet/Voxtral + faster-whisper/MLX patterns)