Files
2nd/10_Wiki/Topics/AI_and_ML/Speech-Recognition-Foundations.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

206 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-speech-recognition-foundations
title: Speech Recognition Foundations
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [ASR, Automatic Speech Recognition, Speech-to-Text, STT]
duplicate_of: none
source_trust_level: A
confidence_score: 0.92
verification_status: applied
tags: [asr, speech, audio, whisper, transformer]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: Whisper/NeMo/faster-whisper
---
# Speech Recognition Foundations
## 매 한 줄
> **"매 acoustic signal → text token 의 sequence-to-sequence map"**. ASR 매 frontend (mel-spectrogram) + acoustic model (CTC/attention/RNNT) + language model (n-gram 또는 LM-fused) 의 stack. 매 2026 매 Whisper-large-v3 / Voxtral / Distil-Whisper / NVIDIA Canary-1B + Parakeet-TDT 의 production 의 표준.
## 매 핵심
### 매 frontend (signal processing)
1. **Resample**: 16kHz mono 매 standard.
2. **STFT → Mel**: 25ms window, 10ms hop, 80 mel bins.
3. **Log-mel + cepstral mean**: 매 Whisper input.
4. **VAD** (Silero, WebRTC): 매 silence 의 trim.
### 매 acoustic-model paradigm
- **Hybrid HMM-DNN** (Kaldi era): 매 deprecated 매 production 새 system.
- **CTC** (Connectionist Temporal Classification): monotonic alignment, blank-token. 매 streaming-friendly.
- **Attention encoder-decoder** (LAS, Whisper): non-monotonic, full-context. 매 high-accuracy offline.
- **RNN-T / Transducer**: streaming + accurate. 매 mobile/voice-assistant 의 dominant (Parakeet-TDT, Apple Siri).
### 매 modern open ASR (2026)
- **Whisper-large-v3** (OpenAI 2023): 99 lang, multilingual + translation.
- **Distil-Whisper-v3.5**: 6× faster, 49% smaller, 1% WER drop.
- **Voxtral** (Mistral 2025): multilingual ASR + LLM-grade understanding.
- **NVIDIA Canary-1B-Flash**: 4-lang, top Open ASR Leaderboard (2025).
- **Parakeet-TDT-1.1B** (NVIDIA): streaming RNN-T + token-and-duration.
- **SeamlessM4T v2** (Meta): speech + text bidirectional translation.
### 매 응용
1. Meeting transcription (Otter, Granola, Fireflies).
2. Captioning (live + offline).
3. Voice agents (Pipecat, LiveKit).
4. Call-center analytics.
5. Accessibility (live caption OS-level).
## 💻 패턴
### Whisper inference (HuggingFace)
```python
from transformers import pipeline
import torch
asr = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v3",
torch_dtype=torch.float16,
device="cuda",
return_timestamps=True,
)
out = asr("meeting.wav", chunk_length_s=30, batch_size=8)
print(out["text"])
for chunk in out["chunks"]:
print(chunk["timestamp"], chunk["text"])
```
### faster-whisper (CTranslate2)
```python
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe(
"meeting.wav",
beam_size=5,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
language="ko",
)
for s in segments:
print(f"[{s.start:.2f}-{s.end:.2f}] {s.text}")
```
### Streaming with Parakeet-TDT (NeMo)
```python
import nemo.collections.asr as nemo_asr
m = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-1.1b")
m.change_decoding_strategy(None)
# Streaming buffered inference
from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
buf = CacheAwareStreamingAudioBuffer(model=m, online_normalization=True)
buf.append_audio_file("call.wav", stream_id=-1)
for chunk in buf.iter_chunks():
text = m.transcribe([chunk])
yield text
```
### MLX Whisper (Apple Silicon)
```python
# pip install mlx-whisper
import mlx_whisper
result = mlx_whisper.transcribe(
"audio.mp3",
path_or_hf_repo="mlx-community/whisper-large-v3-turbo",
word_timestamps=True,
)
print(result["text"])
```
### Custom mel feature extraction
```python
import torch, torchaudio
def log_mel(wav: torch.Tensor, sr: int = 16000) -> torch.Tensor:
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
spec = torchaudio.transforms.MelSpectrogram(
sample_rate=16000, n_fft=400, hop_length=160, n_mels=80,
)(wav)
return torch.log10(torch.clamp(spec, min=1e-10))
```
### CTC decode with KenLM rescoring
```python
from pyctcdecode import build_ctcdecoder
decoder = build_ctcdecoder(
labels=vocab,
kenlm_model_path="lm.arpa",
alpha=0.5, beta=1.5,
)
# logits: (T, V) numpy
text = decoder.decode(logits, beam_width=100)
```
### Diarization + ASR (pyannote + Whisper)
```python
from pyannote.audio import Pipeline
pipe = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
diar = pipe("call.wav", num_speakers=2)
for turn, _, speaker in diar.itertracks(yield_label=True):
seg = wav[int(turn.start*sr):int(turn.end*sr)]
text = asr({"raw": seg, "sampling_rate": sr})["text"]
print(f"{speaker}: {text}")
```
### WER evaluation
```python
from jiwer import wer, cer, compute_measures
ref = "the quick brown fox"
hyp = "the quik brown fox"
print(wer(ref, hyp)) # 0.25
print(cer(ref, hyp))
print(compute_measures(ref, hyp)) # detailed I/D/S
```
## 매 결정 기준
| 상황 | Model |
|---|---|
| 99-lang offline | Whisper-large-v3 |
| Edge / 6× faster | Distil-Whisper-v3.5 |
| Apple Silicon laptop | MLX Whisper Turbo |
| Streaming voice-agent | Parakeet-TDT / Canary-1B-Flash |
| Multilingual + understanding | Voxtral |
| Speech-to-speech translation | SeamlessM4T v2 |
**기본값**: faster-whisper (large-v3) 매 batch; Parakeet-TDT 매 streaming.
## 🔗 Graph
- 부모: [[Sequence-to-Sequence-Models|Sequence-Modeling]]
- 응용: [[Whisper]] · [[Voice-Agent]]
- Adjacent: [[VAD]] · [[TTS]]
## 🤖 LLM 활용
**언제**: 매 transcript post-process (punctuation, summarization), 매 vocab/prompt biasing 매 domain term, 매 hallucination filter.
**언제 X**: 매 raw acoustic decode (use ASR model), 매 numerical WER eval (use jiwer).
## ❌ 안티패턴
- **No VAD**: 매 silence 매 hallucination ("Thanks for watching!"). 매 Whisper 의 known issue.
- **Wrong sample rate**: 매 8kHz / 44kHz 의 mismatch → 매 garbage output.
- **chunk_length_s 너무 길게**: 매 Whisper 30s 의 designed 의. 매 더 길면 truncation.
- **Greedy decode in noisy**: 매 beam search + LM rescore 의 use.
- **No diarization in multi-speaker**: 매 speaker turn 매 lost.
## 🧪 검증 / 중복
- Verified (Whisper paper arXiv:2212.04356; HuggingFace Open ASR Leaderboard 2025; NVIDIA NeMo docs).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — full content (Whisper/Parakeet/Voxtral + faster-whisper/MLX patterns) |