"매 text → audio waveform — 매 neural acoustic + vocoder pipeline". Speech synthesis (TTS) 는 매 text 를 자연스러운 speech 로 변환 — 매 2026 의 ElevenLabs v3 / OpenAI gpt-4o-tts / Coqui XTTS-v2 가 human-indistinguishable level. 매 voice cloning, emotion control, multi-lingual 의 production-ready.
매 핵심
매 진화 단계
Concatenative (1990s): phoneme database 의 splice — 매 robotic.
Parametric / HMM (2000s): statistical 의 generation — 매 muffled.
Neural (2017+): Tacotron 2 + WaveNet — 매 인간 수준 prosody.
End-to-end (2021+): VITS, NaturalSpeech — 매 single model.
Foundation TTS (2024+): Voicebox, NaturalSpeech 3, ElevenLabs v3 — 매 zero-shot voice cloning, emotion, style.
매 Architecture
Acoustic model: text → mel-spectrogram (Tacotron, FastSpeech 2, VITS).
End-to-end: text → waveform 직접 (VITS, NaturalSpeech).
LLM-based (2024+): AudioLM, Voicebox — 매 token-based audio LM.
매 응용
Audiobook / podcast — 매 ElevenLabs Studio.
Voice agent — 매 real-time conversational AI (Vapi, Retell).
Game NPC — 매 dynamic dialog (AI Dungeon, Inworld).
Accessibility — 매 screen reader, dyslexia aid.
Localization — 매 video dubbing (HeyGen, ElevenLabs dubbing).
💻 패턴
ElevenLabs API (production default)
fromelevenlabsimportElevenLabsclient=ElevenLabs(api_key="...")audio=client.text_to_speech.convert(voice_id="21m00Tcm4TlvDq8ikWAM",# Rachelmodel_id="eleven_v3",text="안녕하세요, 매 wiki cleanup batch.",voice_settings={"stability":0.5,"similarity_boost":0.75},)withopen("out.mp3","wb")asf:forchunkinaudio:f.write(chunk)
OpenAI TTS (GPT-5 era)
fromopenaiimportOpenAIclient=OpenAI()resp=client.audio.speech.create(model="gpt-4o-mini-tts",voice="nova",input="Hello world",instructions="Speak in a calm, encouraging tone.",)resp.stream_to_file("out.mp3")
Coqui XTTS-v2 (open-source, voice cloning)
fromTTS.apiimportTTStts=TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")tts.tts_to_file(text="복제된 voice 의 sample",speaker_wav="reference_6sec.wav",# 6s reference 의 zero-shot clonelanguage="ko",file_path="cloned.wav",)
importwebsockets,json,asyncioasyncdefspeak():uri="wss://api.elevenlabs.io/v1/text-to-speech/{vid}/stream-input"asyncwithwebsockets.connect(uri,extra_headers={"xi-api-key":KEY})asws:awaitws.send(json.dumps({"text":" ","voice_settings":{...}}))forchunkinllm_stream():# token-by-token from LLMawaitws.send(json.dumps({"text":chunk}))awaitws.send(json.dumps({"text":""}))