f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
206 lines
6.6 KiB
Markdown
206 lines
6.6 KiB
Markdown
---
|
||
id: wiki-2026-0508-speech-recognition-foundations
|
||
title: Speech Recognition Foundations
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [ASR, Automatic Speech Recognition, Speech-to-Text, STT]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.92
|
||
verification_status: applied
|
||
tags: [asr, speech, audio, whisper, transformer]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: Python
|
||
framework: Whisper/NeMo/faster-whisper
|
||
---
|
||
|
||
# Speech Recognition Foundations
|
||
|
||
## 매 한 줄
|
||
> **"매 acoustic signal → text token 의 sequence-to-sequence map"**. ASR 매 frontend (mel-spectrogram) + acoustic model (CTC/attention/RNNT) + language model (n-gram 또는 LM-fused) 의 stack. 매 2026 매 Whisper-large-v3 / Voxtral / Distil-Whisper / NVIDIA Canary-1B + Parakeet-TDT 의 production 의 표준.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 frontend (signal processing)
|
||
1. **Resample**: 16kHz mono 매 standard.
|
||
2. **STFT → Mel**: 25ms window, 10ms hop, 80 mel bins.
|
||
3. **Log-mel + cepstral mean**: 매 Whisper input.
|
||
4. **VAD** (Silero, WebRTC): 매 silence 의 trim.
|
||
|
||
### 매 acoustic-model paradigm
|
||
- **Hybrid HMM-DNN** (Kaldi era): 매 deprecated 매 production 새 system.
|
||
- **CTC** (Connectionist Temporal Classification): monotonic alignment, blank-token. 매 streaming-friendly.
|
||
- **Attention encoder-decoder** (LAS, Whisper): non-monotonic, full-context. 매 high-accuracy offline.
|
||
- **RNN-T / Transducer**: streaming + accurate. 매 mobile/voice-assistant 의 dominant (Parakeet-TDT, Apple Siri).
|
||
|
||
### 매 modern open ASR (2026)
|
||
- **Whisper-large-v3** (OpenAI 2023): 99 lang, multilingual + translation.
|
||
- **Distil-Whisper-v3.5**: 6× faster, 49% smaller, 1% WER drop.
|
||
- **Voxtral** (Mistral 2025): multilingual ASR + LLM-grade understanding.
|
||
- **NVIDIA Canary-1B-Flash**: 4-lang, top Open ASR Leaderboard (2025).
|
||
- **Parakeet-TDT-1.1B** (NVIDIA): streaming RNN-T + token-and-duration.
|
||
- **SeamlessM4T v2** (Meta): speech + text bidirectional translation.
|
||
|
||
### 매 응용
|
||
1. Meeting transcription (Otter, Granola, Fireflies).
|
||
2. Captioning (live + offline).
|
||
3. Voice agents (Pipecat, LiveKit).
|
||
4. Call-center analytics.
|
||
5. Accessibility (live caption OS-level).
|
||
|
||
## 💻 패턴
|
||
|
||
### Whisper inference (HuggingFace)
|
||
```python
|
||
from transformers import pipeline
|
||
import torch
|
||
|
||
asr = pipeline(
|
||
"automatic-speech-recognition",
|
||
model="openai/whisper-large-v3",
|
||
torch_dtype=torch.float16,
|
||
device="cuda",
|
||
return_timestamps=True,
|
||
)
|
||
out = asr("meeting.wav", chunk_length_s=30, batch_size=8)
|
||
print(out["text"])
|
||
for chunk in out["chunks"]:
|
||
print(chunk["timestamp"], chunk["text"])
|
||
```
|
||
|
||
### faster-whisper (CTranslate2)
|
||
```python
|
||
from faster_whisper import WhisperModel
|
||
|
||
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
|
||
segments, info = model.transcribe(
|
||
"meeting.wav",
|
||
beam_size=5,
|
||
vad_filter=True,
|
||
vad_parameters=dict(min_silence_duration_ms=500),
|
||
language="ko",
|
||
)
|
||
for s in segments:
|
||
print(f"[{s.start:.2f}-{s.end:.2f}] {s.text}")
|
||
```
|
||
|
||
### Streaming with Parakeet-TDT (NeMo)
|
||
```python
|
||
import nemo.collections.asr as nemo_asr
|
||
|
||
m = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-1.1b")
|
||
m.change_decoding_strategy(None)
|
||
|
||
# Streaming buffered inference
|
||
from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
|
||
buf = CacheAwareStreamingAudioBuffer(model=m, online_normalization=True)
|
||
buf.append_audio_file("call.wav", stream_id=-1)
|
||
for chunk in buf.iter_chunks():
|
||
text = m.transcribe([chunk])
|
||
yield text
|
||
```
|
||
|
||
### MLX Whisper (Apple Silicon)
|
||
```python
|
||
# pip install mlx-whisper
|
||
import mlx_whisper
|
||
|
||
result = mlx_whisper.transcribe(
|
||
"audio.mp3",
|
||
path_or_hf_repo="mlx-community/whisper-large-v3-turbo",
|
||
word_timestamps=True,
|
||
)
|
||
print(result["text"])
|
||
```
|
||
|
||
### Custom mel feature extraction
|
||
```python
|
||
import torch, torchaudio
|
||
|
||
def log_mel(wav: torch.Tensor, sr: int = 16000) -> torch.Tensor:
|
||
if sr != 16000:
|
||
wav = torchaudio.functional.resample(wav, sr, 16000)
|
||
spec = torchaudio.transforms.MelSpectrogram(
|
||
sample_rate=16000, n_fft=400, hop_length=160, n_mels=80,
|
||
)(wav)
|
||
return torch.log10(torch.clamp(spec, min=1e-10))
|
||
```
|
||
|
||
### CTC decode with KenLM rescoring
|
||
```python
|
||
from pyctcdecode import build_ctcdecoder
|
||
|
||
decoder = build_ctcdecoder(
|
||
labels=vocab,
|
||
kenlm_model_path="lm.arpa",
|
||
alpha=0.5, beta=1.5,
|
||
)
|
||
# logits: (T, V) numpy
|
||
text = decoder.decode(logits, beam_width=100)
|
||
```
|
||
|
||
### Diarization + ASR (pyannote + Whisper)
|
||
```python
|
||
from pyannote.audio import Pipeline
|
||
pipe = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
|
||
diar = pipe("call.wav", num_speakers=2)
|
||
|
||
for turn, _, speaker in diar.itertracks(yield_label=True):
|
||
seg = wav[int(turn.start*sr):int(turn.end*sr)]
|
||
text = asr({"raw": seg, "sampling_rate": sr})["text"]
|
||
print(f"{speaker}: {text}")
|
||
```
|
||
|
||
### WER evaluation
|
||
```python
|
||
from jiwer import wer, cer, compute_measures
|
||
|
||
ref = "the quick brown fox"
|
||
hyp = "the quik brown fox"
|
||
print(wer(ref, hyp)) # 0.25
|
||
print(cer(ref, hyp))
|
||
print(compute_measures(ref, hyp)) # detailed I/D/S
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 상황 | Model |
|
||
|---|---|
|
||
| 99-lang offline | Whisper-large-v3 |
|
||
| Edge / 6× faster | Distil-Whisper-v3.5 |
|
||
| Apple Silicon laptop | MLX Whisper Turbo |
|
||
| Streaming voice-agent | Parakeet-TDT / Canary-1B-Flash |
|
||
| Multilingual + understanding | Voxtral |
|
||
| Speech-to-speech translation | SeamlessM4T v2 |
|
||
|
||
**기본값**: faster-whisper (large-v3) 매 batch; Parakeet-TDT 매 streaming.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[Sequence-to-Sequence-Models|Sequence-Modeling]]
|
||
- 응용: [[Whisper]] · [[Voice-Agent]]
|
||
- Adjacent: [[VAD]] · [[TTS]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: 매 transcript post-process (punctuation, summarization), 매 vocab/prompt biasing 매 domain term, 매 hallucination filter.
|
||
**언제 X**: 매 raw acoustic decode (use ASR model), 매 numerical WER eval (use jiwer).
|
||
|
||
## ❌ 안티패턴
|
||
- **No VAD**: 매 silence 매 hallucination ("Thanks for watching!"). 매 Whisper 의 known issue.
|
||
- **Wrong sample rate**: 매 8kHz / 44kHz 의 mismatch → 매 garbage output.
|
||
- **chunk_length_s 너무 길게**: 매 Whisper 30s 의 designed 의. 매 더 길면 truncation.
|
||
- **Greedy decode in noisy**: 매 beam search + LM rescore 의 use.
|
||
- **No diarization in multi-speaker**: 매 speaker turn 매 lost.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (Whisper paper arXiv:2212.04356; HuggingFace Open ASR Leaderboard 2025; NVIDIA NeMo docs).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — full content (Whisper/Parakeet/Voxtral + faster-whisper/MLX patterns) |
|