Files
2nd/10_Wiki/Topics/AI_and_ML/Speech-Recognition-Foundations.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.6 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-speech-recognition-foundations Speech Recognition Foundations 10_Wiki/Topics verified self
ASR
Automatic Speech Recognition
Speech-to-Text
STT
none A 0.92 applied
asr
speech
audio
whisper
transformer
2026-05-10 pending
language framework
Python Whisper/NeMo/faster-whisper

Speech Recognition Foundations

매 한 줄

"매 acoustic signal → text token 의 sequence-to-sequence map". ASR 매 frontend (mel-spectrogram) + acoustic model (CTC/attention/RNNT) + language model (n-gram 또는 LM-fused) 의 stack. 매 2026 매 Whisper-large-v3 / Voxtral / Distil-Whisper / NVIDIA Canary-1B + Parakeet-TDT 의 production 의 표준.

매 핵심

매 frontend (signal processing)

  1. Resample: 16kHz mono 매 standard.
  2. STFT → Mel: 25ms window, 10ms hop, 80 mel bins.
  3. Log-mel + cepstral mean: 매 Whisper input.
  4. VAD (Silero, WebRTC): 매 silence 의 trim.

매 acoustic-model paradigm

  • Hybrid HMM-DNN (Kaldi era): 매 deprecated 매 production 새 system.
  • CTC (Connectionist Temporal Classification): monotonic alignment, blank-token. 매 streaming-friendly.
  • Attention encoder-decoder (LAS, Whisper): non-monotonic, full-context. 매 high-accuracy offline.
  • RNN-T / Transducer: streaming + accurate. 매 mobile/voice-assistant 의 dominant (Parakeet-TDT, Apple Siri).

매 modern open ASR (2026)

  • Whisper-large-v3 (OpenAI 2023): 99 lang, multilingual + translation.
  • Distil-Whisper-v3.5: 6× faster, 49% smaller, 1% WER drop.
  • Voxtral (Mistral 2025): multilingual ASR + LLM-grade understanding.
  • NVIDIA Canary-1B-Flash: 4-lang, top Open ASR Leaderboard (2025).
  • Parakeet-TDT-1.1B (NVIDIA): streaming RNN-T + token-and-duration.
  • SeamlessM4T v2 (Meta): speech + text bidirectional translation.

매 응용

  1. Meeting transcription (Otter, Granola, Fireflies).
  2. Captioning (live + offline).
  3. Voice agents (Pipecat, LiveKit).
  4. Call-center analytics.
  5. Accessibility (live caption OS-level).

💻 패턴

Whisper inference (HuggingFace)

from transformers import pipeline
import torch

asr = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device="cuda",
    return_timestamps=True,
)
out = asr("meeting.wav", chunk_length_s=30, batch_size=8)
print(out["text"])
for chunk in out["chunks"]:
    print(chunk["timestamp"], chunk["text"])

faster-whisper (CTranslate2)

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe(
    "meeting.wav",
    beam_size=5,
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500),
    language="ko",
)
for s in segments:
    print(f"[{s.start:.2f}-{s.end:.2f}] {s.text}")

Streaming with Parakeet-TDT (NeMo)

import nemo.collections.asr as nemo_asr

m = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-1.1b")
m.change_decoding_strategy(None)

# Streaming buffered inference
from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
buf = CacheAwareStreamingAudioBuffer(model=m, online_normalization=True)
buf.append_audio_file("call.wav", stream_id=-1)
for chunk in buf.iter_chunks():
    text = m.transcribe([chunk])
    yield text

MLX Whisper (Apple Silicon)

# pip install mlx-whisper
import mlx_whisper

result = mlx_whisper.transcribe(
    "audio.mp3",
    path_or_hf_repo="mlx-community/whisper-large-v3-turbo",
    word_timestamps=True,
)
print(result["text"])

Custom mel feature extraction

import torch, torchaudio

def log_mel(wav: torch.Tensor, sr: int = 16000) -> torch.Tensor:
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    spec = torchaudio.transforms.MelSpectrogram(
        sample_rate=16000, n_fft=400, hop_length=160, n_mels=80,
    )(wav)
    return torch.log10(torch.clamp(spec, min=1e-10))

CTC decode with KenLM rescoring

from pyctcdecode import build_ctcdecoder

decoder = build_ctcdecoder(
    labels=vocab,
    kenlm_model_path="lm.arpa",
    alpha=0.5, beta=1.5,
)
# logits: (T, V) numpy
text = decoder.decode(logits, beam_width=100)

Diarization + ASR (pyannote + Whisper)

from pyannote.audio import Pipeline
pipe = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
diar = pipe("call.wav", num_speakers=2)

for turn, _, speaker in diar.itertracks(yield_label=True):
    seg = wav[int(turn.start*sr):int(turn.end*sr)]
    text = asr({"raw": seg, "sampling_rate": sr})["text"]
    print(f"{speaker}: {text}")

WER evaluation

from jiwer import wer, cer, compute_measures

ref = "the quick brown fox"
hyp = "the quik brown fox"
print(wer(ref, hyp))         # 0.25
print(cer(ref, hyp))
print(compute_measures(ref, hyp))  # detailed I/D/S

매 결정 기준

상황 Model
99-lang offline Whisper-large-v3
Edge / 6× faster Distil-Whisper-v3.5
Apple Silicon laptop MLX Whisper Turbo
Streaming voice-agent Parakeet-TDT / Canary-1B-Flash
Multilingual + understanding Voxtral
Speech-to-speech translation SeamlessM4T v2

기본값: faster-whisper (large-v3) 매 batch; Parakeet-TDT 매 streaming.

🔗 Graph

🤖 LLM 활용

언제: 매 transcript post-process (punctuation, summarization), 매 vocab/prompt biasing 매 domain term, 매 hallucination filter. 언제 X: 매 raw acoustic decode (use ASR model), 매 numerical WER eval (use jiwer).

안티패턴

  • No VAD: 매 silence 매 hallucination ("Thanks for watching!"). 매 Whisper 의 known issue.
  • Wrong sample rate: 매 8kHz / 44kHz 의 mismatch → 매 garbage output.
  • chunk_length_s 너무 길게: 매 Whisper 30s 의 designed 의. 매 더 길면 truncation.
  • Greedy decode in noisy: 매 beam search + LM rescore 의 use.
  • No diarization in multi-speaker: 매 speaker turn 매 lost.

🧪 검증 / 중복

  • Verified (Whisper paper arXiv:2212.04356; HuggingFace Open ASR Leaderboard 2025; NVIDIA NeMo docs).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — full content (Whisper/Parakeet/Voxtral + faster-whisper/MLX patterns)