--- id: wiki-2026-0508-speech-synthesis title: Speech Synthesis category: 10_Wiki/Topics status: verified canonical_id: self aliases: [tts, text-to-speech, voice-synthesis, neural-tts] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [tts, audio, ai, voice-cloning, generative] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: elevenlabs-openai-coqui --- # Speech Synthesis ## 매 한 줄 > **"매 text → audio waveform — 매 neural acoustic + vocoder pipeline"**. Speech synthesis (TTS) 는 매 text 를 자연스러운 speech 로 변환 — 매 2026 의 ElevenLabs v3 / OpenAI gpt-4o-tts / Coqui XTTS-v2 가 human-indistinguishable level. 매 voice cloning, emotion control, multi-lingual 의 production-ready. ## 매 핵심 ### 매 진화 단계 - **Concatenative (1990s)**: phoneme database 의 splice — 매 robotic. - **Parametric / HMM (2000s)**: statistical 의 generation — 매 muffled. - **Neural (2017+)**: Tacotron 2 + WaveNet — 매 인간 수준 prosody. - **End-to-end (2021+)**: VITS, NaturalSpeech — 매 single model. - **Foundation TTS (2024+)**: Voicebox, NaturalSpeech 3, ElevenLabs v3 — 매 zero-shot voice cloning, emotion, style. ### 매 Architecture - **Acoustic model**: text → mel-spectrogram (Tacotron, FastSpeech 2, VITS). - **Vocoder**: mel → waveform (WaveNet, HiFi-GAN, BigVGAN). - **End-to-end**: text → waveform 직접 (VITS, NaturalSpeech). - **LLM-based (2024+)**: AudioLM, Voicebox — 매 token-based audio LM. ### 매 응용 1. **Audiobook / podcast** — 매 ElevenLabs Studio. 2. **Voice agent** — 매 real-time conversational AI (Vapi, Retell). 3. **Game NPC** — 매 dynamic dialog (AI Dungeon, Inworld). 4. **Accessibility** — 매 screen reader, dyslexia aid. 5. **Localization** — 매 video dubbing (HeyGen, ElevenLabs dubbing). ## 💻 패턴 ### ElevenLabs API (production default) ```python from elevenlabs import ElevenLabs client = ElevenLabs(api_key="...") audio = client.text_to_speech.convert( voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel model_id="eleven_v3", text="안녕하세요, 매 wiki cleanup batch.", voice_settings={"stability": 0.5, "similarity_boost": 0.75}, ) with open("out.mp3", "wb") as f: for chunk in audio: f.write(chunk) ``` ### OpenAI TTS (GPT-5 era) ```python from openai import OpenAI client = OpenAI() resp = client.audio.speech.create( model="gpt-4o-mini-tts", voice="nova", input="Hello world", instructions="Speak in a calm, encouraging tone.", ) resp.stream_to_file("out.mp3") ``` ### Coqui XTTS-v2 (open-source, voice cloning) ```python from TTS.api import TTS tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda") tts.tts_to_file( text="복제된 voice 의 sample", speaker_wav="reference_6sec.wav", # 6s reference 의 zero-shot clone language="ko", file_path="cloned.wav", ) ``` ### Streaming (low-latency agent) ```python from elevenlabs import stream audio_stream = client.text_to_speech.convert_as_stream( voice_id="...", model_id="eleven_flash_v2_5", # ~75ms TTFB text="streamed reply", ) stream(audio_stream) ``` ### Real-time (WebSocket, sub-200ms) ```python import websockets, json, asyncio async def speak(): uri = "wss://api.elevenlabs.io/v1/text-to-speech/{vid}/stream-input" async with websockets.connect(uri, extra_headers={"xi-api-key": KEY}) as ws: await ws.send(json.dumps({"text": " ", "voice_settings": {...}})) for chunk in llm_stream(): # token-by-token from LLM await ws.send(json.dumps({"text": chunk})) await ws.send(json.dumps({"text": ""})) ``` ### MLX local TTS (Apple Silicon) ```python from mlx_audio.tts.generate import generate_audio generate_audio(text="local synthesis", model_path="mlx-community/Kokoro-82M-bf16", voice="af_heart", output_path="local.wav") ``` ### SSML / prosody control ```xml 매 천천히, 높은 톤 중요합니다. ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Production voice agent | ElevenLabs Flash v2.5 (75ms TTFB) | | Highest quality narration | ElevenLabs v3 / OpenAI gpt-4o-tts | | Voice cloning (legal consent) | XTTS-v2 (open) / ElevenLabs Pro | | On-device / privacy | Kokoro-82M (MLX) / Piper | | Multilingual dub | ElevenLabs dubbing API / HeyGen | | Cost-sensitive batch | OpenAI tts-1 / self-hosted Coqui | **기본값**: ElevenLabs Flash v2.5 (real-time) / v3 (quality batch). 매 on-device 는 Kokoro-82M. ## 🔗 Graph - 부모: [[Generative-AI]] - 변형: [[Voice-Cloning]] - Adjacent: [[ASR]] · [[Whisper]] · [[Multimodal-LLM]] ## 🤖 LLM 활용 **언제**: voice agent (LLM → TTS pipeline), dynamic narration, accessibility, localization. **언제 X**: legal/medical 의 critical announcements (human voice 의 trust 필요), sung performance (specialized 모델 사용). ## ❌ 안티패턴 - **Voice cloning 의 consent 없이**: deepfake 의 legal/ethical violation. ElevenLabs Voice Verification 사용. - **Long-form 의 single API call**: timeout / cost spike. Chunk by sentence + stream. - **No SSML for prosody**: monotone delivery. `` 활용. - **Wrong sample rate mixing**: 22kHz vs 44.1kHz mix → distortion. Resample first. - **Pronunciation 의 무방치**: 고유명사 mispronounce. Phoneme override / lexicon 사용. ## 🧪 검증 / 중복 - Verified (ElevenLabs docs 2026, OpenAI Audio API, Coqui XTTS-v2 paper, Apple MLX-Audio). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — modern TTS landscape (ElevenLabs v3, OpenAI gpt-4o-tts, XTTS-v2, Kokoro) |