--- id: ai-voice-cloning-synthesis title: Voice Cloning / Synthesis — ElevenLabs / OpenAI / Self-host category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, voice, tts, vibe-coding] tech_stack: { language: "TS / Python", applicable_to: ["Backend"] } applied_in: [] aliases: [voice cloning, TTS, ElevenLabs, OpenAI TTS, Coqui, Bark, Piper, instant clone] --- # Voice Cloning / Synthesis > Text → 사람 같은 음성. **ElevenLabs (sota), OpenAI TTS (cheap), Cartesia / PlayHT (fast). Self-host: Coqui / Bark / Piper**. 30 second sample = clone (ethical 주의). ## 📖 핵심 개념 - TTS: Text-to-Speech. - Voice clone: 짧은 sample → personal voice. - Latency: real-time conversation = < 500ms. - Streaming: text 도착하며 동시 audio. ## 💻 코드 패턴 ### ElevenLabs (best quality) ```ts import { ElevenLabsClient } from 'elevenlabs'; const client = new ElevenLabsClient({ apiKey }); const audio = await client.textToSpeech.convert('voice-id', { text: 'Hello world', modelId: 'eleven_turbo_v2_5', outputFormat: 'mp3_44100_128', }); // audio = AsyncIterable const chunks: Buffer[] = []; for await (const chunk of audio) chunks.push(chunk); const mp3 = Buffer.concat(chunks); ``` ### Streaming (real-time) ```ts const stream = await client.textToSpeech.convertAsStream('voice-id', { text: longText, modelId: 'eleven_flash_v2_5', // 가장 빠름 }); // Pipe to speaker for await (const chunk of stream) { speaker.write(chunk); } ``` ### Voice clone (instant) ```ts const voice = await client.voices.add({ name: 'Alice', files: [fs.createReadStream('alice-sample.mp3')], // 30s+ description: 'Alice voice clone', }); // 사용 const audio = await client.textToSpeech.convert(voice.voiceId, { text: 'Hi, this is Alice.', }); ``` ### Voice design (text → voice) ```ts const voice = await client.voices.design({ description: 'A young energetic female voice with British accent', text: 'Sample text to test', }); ``` → Description 만 — sample 없이. ### OpenAI TTS (cheap) ```ts import OpenAI from 'openai'; const r = await openai.audio.speech.create({ model: 'tts-1-hd', // 또는 tts-1 voice: 'alloy', // alloy / echo / fable / onyx / nova / shimmer / ash / sage / coral input: text, response_format: 'mp3', speed: 1.0, }); const buf = Buffer.from(await r.arrayBuffer()); fs.writeFileSync('out.mp3', buf); ``` → 6 voice. 빠름 + cheap. Clone 안 됨. ### gpt-4o-mini-tts (instructions, 2024+) ```ts const r = await openai.audio.speech.create({ model: 'gpt-4o-mini-tts', voice: 'coral', input: 'Welcome!', instructions: 'Speak in a cheerful and professional tone', }); ``` → Instruction-following voice. 작은 control. ### Cartesia (fast, low-latency) ```ts import { CartesiaClient } from '@cartesia/cartesia-js'; const cartesia = new CartesiaClient({ apiKey }); const ws = await cartesia.tts.websocket({ containerSettings: { container: 'raw', encoding: 'pcm_s16le', sample_rate: 44100 }, }); await ws.send({ modelId: 'sonic-2', voice: { mode: 'id', id: 'voice-id' }, transcript: 'Streaming text', }); ws.onMessage((msg) => { if (msg.type === 'chunk') speaker.write(Buffer.from(msg.data, 'base64')); }); ``` → 75ms latency. Real-time agent. ### PlayHT ```ts const r = await fetch('https://api.play.ht/api/v2/tts/stream', { method: 'POST', headers: { Authorization: `Bearer ${apiKey}`, 'X-User-ID': userId, }, body: JSON.stringify({ text, voice: 'voice-id', output_format: 'mp3', voice_engine: 'PlayHT2.0-turbo', }), }); // Stream for await (const chunk of r.body!) { speaker.write(chunk); } ``` ### Self-host — Coqui XTTS ```python from TTS.api import TTS tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2').to('cuda') tts.tts_to_file( text='Hello', speaker_wav='alice.wav', # voice clone (6s+) language='en', file_path='out.wav', ) ``` → Self-host. GPU 필요. ### Self-host — Piper (fast CPU) ```bash echo 'Hello' | piper --model en_US-lessac-medium.onnx --output_file out.wav ``` → ONNX 기반. CPU 도 OK. ### Bark (Suno) ```python from bark import generate_audio, preload_models preload_models() audio = generate_audio('Hello, [laughs] this is Bark!') ``` → 표현 (laughs, sigh, music) 가능. ### Voice agent (real-time conversation) ```ts // 사용자 audio → STT → LLM → TTS → 응답 audio const stt = whisper.transcribe(userAudio); // ~500ms const reply = await llm.complete(stt); // ~500ms const audio = await tts.stream(reply); // 75ms first chunk // Total: ~1075 ms 첫 audio ``` → Latency 가 핵심. Streaming + streaming + streaming. ### OpenAI Realtime API (all-in-one) ```ts const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', { headers: { Authorization: `Bearer ${apiKey}` }, }); ws.send({ type: 'session.update', session: { voice: 'alloy', turn_detection: { type: 'server_vad' } }, }); // 사용자 audio ws.send({ type: 'input_audio_buffer.append', audio: base64Pcm }); // 응답 audio (자동 stream) ws.on('message', (msg) => { const ev = JSON.parse(msg); if (ev.type === 'response.audio.delta') { speaker.write(Buffer.from(ev.delta, 'base64')); } }); ``` → STT + LLM + TTS = 한 model. Latency 가장 작음. → [[AI_Voice_Agent_Realtime]]. ### Cost (대략) ``` ElevenLabs: $0.30 / 1K char (turbo) OpenAI TTS: $15 / 1M char (tts-1-hd) Cartesia: $0.20 / 1K char PlayHT: $0.30 / 1K char Self-host: GPU cost only → Big traffic = self-host. ``` ### Browser TTS (free, low quality) ```ts const utt = new SpeechSynthesisUtterance('Hello'); utt.voice = speechSynthesis.getVoices().find(v => v.name.includes('Samantha')); speechSynthesis.speak(utt); ``` → OS 의 voice. 무료 but quality 낮음. ### Audio formats ``` MP3: 범용, 작음 Opus: Modern, 가장 작음 PCM: Raw, real-time 친화 WAV: Uncompressed, 큰 M4A/AAC: iOS 친화 → Streaming = PCM / Opus. Storage = MP3 / Opus. ``` ### Use cases ``` ✅ Voice agent / chatbot ✅ Audiobook ✅ Accessibility ✅ Game NPC ✅ IVR (phone) ✅ Notification audio ✅ Podcast (auto-generation) ``` ### Voice clone — ethics / legal ``` - 사용자 동의 필수 - 작가 / actor 의 voice rights - Misuse (deepfake, fraud) - Watermarking (몇 service) ElevenLabs: 자동 watermark + abuse detection. ``` → 회사 / artist consent 필수. ### Multi-language ``` ElevenLabs: 32 lang OpenAI TTS: 11 lang (영어 best) Coqui XTTS: 17 lang ``` ### SSML (Speech Synthesis Markup Language) ```xml Hello, important news. Speaking slowly ``` → 일부 service 만 (Google, Azure). ### Voice activity detection (VAD) ```ts // 사용자가 말 끝 감지 import { VAD } from '@ricky0123/vad-web'; const vad = await VAD.new({ onSpeechEnd: (audio) => { sendToSTT(audio); }, }); vad.start(); ``` → Silero / WebRTC VAD. ### Subtitle / caption (TTS 와 같이) ```ts // ElevenLabs returns alignment const r = await client.textToSpeech.convertWithTimestamps('voice-id', { text }); // r.alignment = { characters, character_start_times, character_end_times } ``` → Karaoke-style subtitle. ### Evaluation ```ts // Subjective: // 1. 자연스러움 (1-5) // 2. Clarity // 3. Emotion accuracy // 4. Pronunciation // 5. Speed // Objective: // MOS score (Mean Opinion Score) // Word Error Rate (transcribe back) ``` ### Privacy ``` - 사용자 voice = sensitive - 외부 API = data 전송 - Self-host = privacy 강 - Anonymization 검토 ``` ## 🤔 의사결정 기준 | 사용 | 추천 | |---|---| | Best quality + clone | ElevenLabs | | Cheap + general | OpenAI TTS | | Real-time agent | Cartesia / OpenAI Realtime | | Self-host | Coqui XTTS / Piper | | Browser only | speechSynthesis | | Multi-language | ElevenLabs | | Game / interactive | Bark / ElevenLabs | ## ❌ 안티패턴 - **Voice clone + consent 없음**: 윤리 / 법적. - **Real-time + slow API**: 사용자 답답. Streaming. - **모든 곳 best model**: cost. Mix. - **Cache 없음 (같은 text 매번)**: 비용. - **Audio file 큰 (WAV)**: bandwidth. Opus / MP3. - **Subtitle 없는 long audio**: a11y / SEO. - **Watermark 없음**: deepfake risk. ## 🤖 LLM 활용 힌트 - ElevenLabs = quality. OpenAI = cheap. Cartesia = speed. - Real-time = streaming + low-latency model. - Self-host = Coqui / Piper. - Consent + watermark + abuse detection. ## 🔗 관련 문서 - [[AI_Voice_Agent_Realtime]] - [[AI_Multimodal_Vision_Patterns]] - [[AI_LLM_Cost_Optimization]]