---
id: wiki-2026-0508-speech-recognition-foundations
title: Speech Recognition Foundations
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [ASR, Automatic Speech Recognition, Speech-to-Text, STT]
duplicate_of: none
source_trust_level: A
confidence_score: 0.92
verification_status: applied
tags: [asr, speech, audio, whisper, transformer]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: Whisper/NeMo/faster-whisper
---

# Speech Recognition Foundations

## 매 한 줄
> **"매 acoustic signal → text token 의 sequence-to-sequence map"**. ASR 매 frontend (mel-spectrogram) + acoustic model (CTC/attention/RNNT) + language model (n-gram 또는 LM-fused) 의 stack. 매 2026 매 Whisper-large-v3 / Voxtral / Distil-Whisper / NVIDIA Canary-1B + Parakeet-TDT 의 production 의 표준.

## 매 핵심

### 매 frontend (signal processing)
1. **Resample**: 16kHz mono 매 standard.
2. **STFT → Mel**: 25ms window, 10ms hop, 80 mel bins.
3. **Log-mel + cepstral mean**: 매 Whisper input.
4. **VAD** (Silero, WebRTC): 매 silence 의 trim.

### 매 acoustic-model paradigm
- **Hybrid HMM-DNN** (Kaldi era): 매 deprecated 매 production 새 system.
- **CTC** (Connectionist Temporal Classification): monotonic alignment, blank-token. 매 streaming-friendly.
- **Attention encoder-decoder** (LAS, Whisper): non-monotonic, full-context. 매 high-accuracy offline.
- **RNN-T / Transducer**: streaming + accurate. 매 mobile/voice-assistant 의 dominant (Parakeet-TDT, Apple Siri).

### 매 modern open ASR (2026)
- **Whisper-large-v3** (OpenAI 2023): 99 lang, multilingual + translation.
- **Distil-Whisper-v3.5**: 6× faster, 49% smaller, 1% WER drop.
- **Voxtral** (Mistral 2025): multilingual ASR + LLM-grade understanding.
- **NVIDIA Canary-1B-Flash**: 4-lang, top Open ASR Leaderboard (2025).
- **Parakeet-TDT-1.1B** (NVIDIA): streaming RNN-T + token-and-duration.
- **SeamlessM4T v2** (Meta): speech + text bidirectional translation.

### 매 응용
1. Meeting transcription (Otter, Granola, Fireflies).
2. Captioning (live + offline).
3. Voice agents (Pipecat, LiveKit).
4. Call-center analytics.
5. Accessibility (live caption OS-level).

## 💻 패턴

### Whisper inference (HuggingFace)
```python
from transformers import pipeline
import torch

asr = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device="cuda",
    return_timestamps=True,
)
out = asr("meeting.wav", chunk_length_s=30, batch_size=8)
print(out["text"])
for chunk in out["chunks"]:
    print(chunk["timestamp"], chunk["text"])
```

### faster-whisper (CTranslate2)
```python
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe(
    "meeting.wav",
    beam_size=5,
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500),
    language="ko",
)
for s in segments:
    print(f"[{s.start:.2f}-{s.end:.2f}] {s.text}")
```

### Streaming with Parakeet-TDT (NeMo)
```python
import nemo.collections.asr as nemo_asr

m = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-1.1b")
m.change_decoding_strategy(None)

# Streaming buffered inference
from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
buf = CacheAwareStreamingAudioBuffer(model=m, online_normalization=True)
buf.append_audio_file("call.wav", stream_id=-1)
for chunk in buf.iter_chunks():
    text = m.transcribe([chunk])
    yield text
```

### MLX Whisper (Apple Silicon)
```python
# pip install mlx-whisper
import mlx_whisper

result = mlx_whisper.transcribe(
    "audio.mp3",
    path_or_hf_repo="mlx-community/whisper-large-v3-turbo",
    word_timestamps=True,
)
print(result["text"])
```

### Custom mel feature extraction
```python
import torch, torchaudio

def log_mel(wav: torch.Tensor, sr: int = 16000) -> torch.Tensor:
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    spec = torchaudio.transforms.MelSpectrogram(
        sample_rate=16000, n_fft=400, hop_length=160, n_mels=80,
    )(wav)
    return torch.log10(torch.clamp(spec, min=1e-10))
```

### CTC decode with KenLM rescoring
```python
from pyctcdecode import build_ctcdecoder

decoder = build_ctcdecoder(
    labels=vocab,
    kenlm_model_path="lm.arpa",
    alpha=0.5, beta=1.5,
)
# logits: (T, V) numpy
text = decoder.decode(logits, beam_width=100)
```

### Diarization + ASR (pyannote + Whisper)
```python
from pyannote.audio import Pipeline
pipe = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
diar = pipe("call.wav", num_speakers=2)

for turn, _, speaker in diar.itertracks(yield_label=True):
    seg = wav[int(turn.start*sr):int(turn.end*sr)]
    text = asr({"raw": seg, "sampling_rate": sr})["text"]
    print(f"{speaker}: {text}")
```

### WER evaluation
```python
from jiwer import wer, cer, compute_measures

ref = "the quick brown fox"
hyp = "the quik brown fox"
print(wer(ref, hyp))         # 0.25
print(cer(ref, hyp))
print(compute_measures(ref, hyp))  # detailed I/D/S
```

## 매 결정 기준
| 상황 | Model |
|---|---|
| 99-lang offline | Whisper-large-v3 |
| Edge / 6× faster | Distil-Whisper-v3.5 |
| Apple Silicon laptop | MLX Whisper Turbo |
| Streaming voice-agent | Parakeet-TDT / Canary-1B-Flash |
| Multilingual + understanding | Voxtral |
| Speech-to-speech translation | SeamlessM4T v2 |

**기본값**: faster-whisper (large-v3) 매 batch; Parakeet-TDT 매 streaming.

## 🔗 Graph
- 부모: [[Sequence-to-Sequence-Models|Sequence-Modeling]]
- 응용: [[Whisper]] · [[Voice-Agent]]
- Adjacent: [[VAD]] · [[TTS]]

## 🤖 LLM 활용
**언제**: 매 transcript post-process (punctuation, summarization), 매 vocab/prompt biasing 매 domain term, 매 hallucination filter.
**언제 X**: 매 raw acoustic decode (use ASR model), 매 numerical WER eval (use jiwer).

## ❌ 안티패턴
- **No VAD**: 매 silence 매 hallucination ("Thanks for watching!"). 매 Whisper 의 known issue.
- **Wrong sample rate**: 매 8kHz / 44kHz 의 mismatch → 매 garbage output.
- **chunk_length_s 너무 길게**: 매 Whisper 30s 의 designed 의. 매 더 길면 truncation.
- **Greedy decode in noisy**: 매 beam search + LM rescore 의 use.
- **No diarization in multi-speaker**: 매 speaker turn 매 lost.

## 🧪 검증 / 중복
- Verified (Whisper paper arXiv:2212.04356; HuggingFace Open ASR Leaderboard 2025; NVIDIA NeMo docs).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — full content (Whisper/Parakeet/Voxtral + faster-whisper/MLX patterns) |