Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

6.6 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Speech Recognition Foundations

매 한 줄

"매 acoustic signal → text token 의 sequence-to-sequence map". ASR 매 frontend (mel-spectrogram) + acoustic model (CTC/attention/RNNT) + language model (n-gram 또는 LM-fused) 의 stack. 매 2026 매 Whisper-large-v3 / Voxtral / Distil-Whisper / NVIDIA Canary-1B + Parakeet-TDT 의 production 의 표준.

매 핵심

매 frontend (signal processing)

Resample: 16kHz mono 매 standard.
STFT → Mel: 25ms window, 10ms hop, 80 mel bins.
Log-mel + cepstral mean: 매 Whisper input.
VAD (Silero, WebRTC): 매 silence 의 trim.

매 acoustic-model paradigm

Hybrid HMM-DNN (Kaldi era): 매 deprecated 매 production 새 system.
CTC (Connectionist Temporal Classification): monotonic alignment, blank-token. 매 streaming-friendly.
Attention encoder-decoder (LAS, Whisper): non-monotonic, full-context. 매 high-accuracy offline.
RNN-T / Transducer: streaming + accurate. 매 mobile/voice-assistant 의 dominant (Parakeet-TDT, Apple Siri).

매 modern open ASR (2026)

Whisper-large-v3 (OpenAI 2023): 99 lang, multilingual + translation.
Distil-Whisper-v3.5: 6× faster, 49% smaller, 1% WER drop.
Voxtral (Mistral 2025): multilingual ASR + LLM-grade understanding.
NVIDIA Canary-1B-Flash: 4-lang, top Open ASR Leaderboard (2025).
Parakeet-TDT-1.1B (NVIDIA): streaming RNN-T + token-and-duration.
SeamlessM4T v2 (Meta): speech + text bidirectional translation.

매 응용

Meeting transcription (Otter, Granola, Fireflies).
Captioning (live + offline).
Voice agents (Pipecat, LiveKit).
Call-center analytics.
Accessibility (live caption OS-level).

💻 패턴

Whisper inference (HuggingFace)

from transformers import pipeline
import torch

asr = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device="cuda",
    return_timestamps=True,
)
out = asr("meeting.wav", chunk_length_s=30, batch_size=8)
print(out["text"])
for chunk in out["chunks"]:
    print(chunk["timestamp"], chunk["text"])

faster-whisper (CTranslate2)

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe(
    "meeting.wav",
    beam_size=5,
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500),
    language="ko",
)
for s in segments:
    print(f"[{s.start:.2f}-{s.end:.2f}] {s.text}")

Streaming with Parakeet-TDT (NeMo)

import nemo.collections.asr as nemo_asr

m = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-1.1b")
m.change_decoding_strategy(None)

# Streaming buffered inference
from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
buf = CacheAwareStreamingAudioBuffer(model=m, online_normalization=True)
buf.append_audio_file("call.wav", stream_id=-1)
for chunk in buf.iter_chunks():
    text = m.transcribe([chunk])
    yield text

MLX Whisper (Apple Silicon)

# pip install mlx-whisper
import mlx_whisper

result = mlx_whisper.transcribe(
    "audio.mp3",
    path_or_hf_repo="mlx-community/whisper-large-v3-turbo",
    word_timestamps=True,
)
print(result["text"])

Custom mel feature extraction

import torch, torchaudio

def log_mel(wav: torch.Tensor, sr: int = 16000) -> torch.Tensor:
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    spec = torchaudio.transforms.MelSpectrogram(
        sample_rate=16000, n_fft=400, hop_length=160, n_mels=80,
    )(wav)
    return torch.log10(torch.clamp(spec, min=1e-10))

CTC decode with KenLM rescoring

from pyctcdecode import build_ctcdecoder

decoder = build_ctcdecoder(
    labels=vocab,
    kenlm_model_path="lm.arpa",
    alpha=0.5, beta=1.5,
)
# logits: (T, V) numpy
text = decoder.decode(logits, beam_width=100)

Diarization + ASR (pyannote + Whisper)

from pyannote.audio import Pipeline
pipe = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
diar = pipe("call.wav", num_speakers=2)

for turn, _, speaker in diar.itertracks(yield_label=True):
    seg = wav[int(turn.start*sr):int(turn.end*sr)]
    text = asr({"raw": seg, "sampling_rate": sr})["text"]
    print(f"{speaker}: {text}")

WER evaluation

from jiwer import wer, cer, compute_measures

ref = "the quick brown fox"
hyp = "the quik brown fox"
print(wer(ref, hyp))         # 0.25
print(cer(ref, hyp))
print(compute_measures(ref, hyp))  # detailed I/D/S

매 결정 기준

상황	Model
99-lang offline	Whisper-large-v3
Edge / 6× faster	Distil-Whisper-v3.5
Apple Silicon laptop	MLX Whisper Turbo
Streaming voice-agent	Parakeet-TDT / Canary-1B-Flash
Multilingual + understanding	Voxtral
Speech-to-speech translation	SeamlessM4T v2

기본값: faster-whisper (large-v3) 매 batch; Parakeet-TDT 매 streaming.

🔗 Graph

부모: Sequence-to-Sequence-Models
응용: Whisper · Voice-Agent
Adjacent: VAD · TTS

🤖 LLM 활용

언제: 매 transcript post-process (punctuation, summarization), 매 vocab/prompt biasing 매 domain term, 매 hallucination filter. 언제 X: 매 raw acoustic decode (use ASR model), 매 numerical WER eval (use jiwer).

❌ 안티패턴

No VAD: 매 silence 매 hallucination ("Thanks for watching!"). 매 Whisper 의 known issue.
Wrong sample rate: 매 8kHz / 44kHz 의 mismatch → 매 garbage output.
chunk_length_s 너무 길게: 매 Whisper 30s 의 designed 의. 매 더 길면 truncation.
Greedy decode in noisy: 매 beam search + LM rescore 의 use.
No diarization in multi-speaker: 매 speaker turn 매 lost.

🧪 검증 / 중복

Verified (Whisper paper arXiv:2212.04356; HuggingFace Open ASR Leaderboard 2025; NVIDIA NeMo docs).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — full content (Whisper/Parakeet/Voxtral + faster-whisper/MLX patterns)

6.6 KiB Raw Blame History Unescape Escape