---
id: ai-voice-cloning-synthesis
title: Voice Cloning / Synthesis — ElevenLabs / OpenAI / Self-host
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [ai, voice, tts, vibe-coding]
tech_stack: { language: "TS / Python", applicable_to: ["Backend"] }
applied_in: []
aliases: [voice cloning, TTS, ElevenLabs, OpenAI TTS, Coqui, Bark, Piper, instant clone]
---

# Voice Cloning / Synthesis

> Text → 사람 같은 음성. **ElevenLabs (sota), OpenAI TTS (cheap), Cartesia / PlayHT (fast). Self-host: Coqui / Bark / Piper**. 30 second sample = clone (ethical 주의).

## 📖 핵심 개념
- TTS: Text-to-Speech.
- Voice clone: 짧은 sample → personal voice.
- Latency: real-time conversation = < 500ms.
- Streaming: text 도착하며 동시 audio.

## 💻 코드 패턴

### ElevenLabs (best quality)
```ts
import { ElevenLabsClient } from 'elevenlabs';

const client = new ElevenLabsClient({ apiKey });

const audio = await client.textToSpeech.convert('voice-id', {
  text: 'Hello world',
  modelId: 'eleven_turbo_v2_5',
  outputFormat: 'mp3_44100_128',
});

// audio = AsyncIterable<Buffer>
const chunks: Buffer[] = [];
for await (const chunk of audio) chunks.push(chunk);
const mp3 = Buffer.concat(chunks);
```

### Streaming (real-time)
```ts
const stream = await client.textToSpeech.convertAsStream('voice-id', {
  text: longText,
  modelId: 'eleven_flash_v2_5',  // 가장 빠름
});

// Pipe to speaker
for await (const chunk of stream) {
  speaker.write(chunk);
}
```

### Voice clone (instant)
```ts
const voice = await client.voices.add({
  name: 'Alice',
  files: [fs.createReadStream('alice-sample.mp3')],  // 30s+
  description: 'Alice voice clone',
});

// 사용
const audio = await client.textToSpeech.convert(voice.voiceId, {
  text: 'Hi, this is Alice.',
});
```

### Voice design (text → voice)
```ts
const voice = await client.voices.design({
  description: 'A young energetic female voice with British accent',
  text: 'Sample text to test',
});
```

→ Description 만 — sample 없이.

### OpenAI TTS (cheap)
```ts
import OpenAI from 'openai';

const r = await openai.audio.speech.create({
  model: 'tts-1-hd',  // 또는 tts-1
  voice: 'alloy',  // alloy / echo / fable / onyx / nova / shimmer / ash / sage / coral
  input: text,
  response_format: 'mp3',
  speed: 1.0,
});

const buf = Buffer.from(await r.arrayBuffer());
fs.writeFileSync('out.mp3', buf);
```

→ 6 voice. 빠름 + cheap. Clone 안 됨.

### gpt-4o-mini-tts (instructions, 2024+)
```ts
const r = await openai.audio.speech.create({
  model: 'gpt-4o-mini-tts',
  voice: 'coral',
  input: 'Welcome!',
  instructions: 'Speak in a cheerful and professional tone',
});
```

→ Instruction-following voice. 작은 control.

### Cartesia (fast, low-latency)
```ts
import { CartesiaClient } from '@cartesia/cartesia-js';

const cartesia = new CartesiaClient({ apiKey });

const ws = await cartesia.tts.websocket({
  containerSettings: { container: 'raw', encoding: 'pcm_s16le', sample_rate: 44100 },
});

await ws.send({
  modelId: 'sonic-2',
  voice: { mode: 'id', id: 'voice-id' },
  transcript: 'Streaming text',
});

ws.onMessage((msg) => {
  if (msg.type === 'chunk') speaker.write(Buffer.from(msg.data, 'base64'));
});
```

→ 75ms latency. Real-time agent.

### PlayHT
```ts
const r = await fetch('https://api.play.ht/api/v2/tts/stream', {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${apiKey}`,
    'X-User-ID': userId,
  },
  body: JSON.stringify({
    text,
    voice: 'voice-id',
    output_format: 'mp3',
    voice_engine: 'PlayHT2.0-turbo',
  }),
});

// Stream
for await (const chunk of r.body!) {
  speaker.write(chunk);
}
```

### Self-host — Coqui XTTS
```python
from TTS.api import TTS

tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2').to('cuda')

tts.tts_to_file(
    text='Hello',
    speaker_wav='alice.wav',  # voice clone (6s+)
    language='en',
    file_path='out.wav',
)
```

→ Self-host. GPU 필요.

### Self-host — Piper (fast CPU)
```bash
echo 'Hello' | piper --model en_US-lessac-medium.onnx --output_file out.wav
```

→ ONNX 기반. CPU 도 OK.

### Bark (Suno)
```python
from bark import generate_audio, preload_models

preload_models()
audio = generate_audio('Hello, [laughs] this is Bark!')
```

→ 표현 (laughs, sigh, music) 가능.

### Voice agent (real-time conversation)
```ts
// 사용자 audio → STT → LLM → TTS → 응답 audio

const stt = whisper.transcribe(userAudio);  // ~500ms
const reply = await llm.complete(stt);       // ~500ms
const audio = await tts.stream(reply);       // 75ms first chunk
// Total: ~1075 ms 첫 audio
```

→ Latency 가 핵심. Streaming + streaming + streaming.

### OpenAI Realtime API (all-in-one)
```ts
const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', {
  headers: { Authorization: `Bearer ${apiKey}` },
});

ws.send({
  type: 'session.update',
  session: { voice: 'alloy', turn_detection: { type: 'server_vad' } },
});

// 사용자 audio
ws.send({ type: 'input_audio_buffer.append', audio: base64Pcm });

// 응답 audio (자동 stream)
ws.on('message', (msg) => {
  const ev = JSON.parse(msg);
  if (ev.type === 'response.audio.delta') {
    speaker.write(Buffer.from(ev.delta, 'base64'));
  }
});
```

→ STT + LLM + TTS = 한 model. Latency 가장 작음.

→ [[AI_Voice_Agent_Realtime]].

### Cost (대략)
```
ElevenLabs:  $0.30 / 1K char (turbo)
OpenAI TTS:  $15 / 1M char (tts-1-hd)
Cartesia:    $0.20 / 1K char
PlayHT:      $0.30 / 1K char
Self-host:   GPU cost only

→ Big traffic = self-host.
```

### Browser TTS (free, low quality)
```ts
const utt = new SpeechSynthesisUtterance('Hello');
utt.voice = speechSynthesis.getVoices().find(v => v.name.includes('Samantha'));
speechSynthesis.speak(utt);
```

→ OS 의 voice. 무료 but quality 낮음.

### Audio formats
```
MP3:    범용, 작음
Opus:   Modern, 가장 작음
PCM:    Raw, real-time 친화
WAV:    Uncompressed, 큰
M4A/AAC: iOS 친화

→ Streaming = PCM / Opus.
   Storage = MP3 / Opus.
```

### Use cases
```
✅ Voice agent / chatbot
✅ Audiobook
✅ Accessibility
✅ Game NPC
✅ IVR (phone)
✅ Notification audio
✅ Podcast (auto-generation)
```

### Voice clone — ethics / legal
```
- 사용자 동의 필수
- 작가 / actor 의 voice rights
- Misuse (deepfake, fraud)
- Watermarking (몇 service)

ElevenLabs: 자동 watermark + abuse detection.
```

→ 회사 / artist consent 필수.

### Multi-language
```
ElevenLabs: 32 lang
OpenAI TTS: 11 lang (영어 best)
Coqui XTTS: 17 lang
```

### SSML (Speech Synthesis Markup Language)
```xml
<speak>
  Hello, <break time="500ms"/> 
  <emphasis level="strong">important</emphasis> news.
  <prosody rate="slow">Speaking slowly</prosody>
</speak>
```

→ 일부 service 만 (Google, Azure).

### Voice activity detection (VAD)
```ts
// 사용자가 말 끝 감지
import { VAD } from '@ricky0123/vad-web';

const vad = await VAD.new({
  onSpeechEnd: (audio) => {
    sendToSTT(audio);
  },
});

vad.start();
```

→ Silero / WebRTC VAD.

### Subtitle / caption (TTS 와 같이)
```ts
// ElevenLabs returns alignment
const r = await client.textToSpeech.convertWithTimestamps('voice-id', { text });

// r.alignment = { characters, character_start_times, character_end_times }
```

→ Karaoke-style subtitle.

### Evaluation
```ts
// Subjective:
// 1. 자연스러움 (1-5)
// 2. Clarity
// 3. Emotion accuracy
// 4. Pronunciation
// 5. Speed

// Objective:
// MOS score (Mean Opinion Score)
// Word Error Rate (transcribe back)
```

### Privacy
```
- 사용자 voice = sensitive
- 외부 API = data 전송
- Self-host = privacy 강
- Anonymization 검토
```

## 🤔 의사결정 기준
| 사용 | 추천 |
|---|---|
| Best quality + clone | ElevenLabs |
| Cheap + general | OpenAI TTS |
| Real-time agent | Cartesia / OpenAI Realtime |
| Self-host | Coqui XTTS / Piper |
| Browser only | speechSynthesis |
| Multi-language | ElevenLabs |
| Game / interactive | Bark / ElevenLabs |

## ❌ 안티패턴
- **Voice clone + consent 없음**: 윤리 / 법적.
- **Real-time + slow API**: 사용자 답답. Streaming.
- **모든 곳 best model**: cost. Mix.
- **Cache 없음 (같은 text 매번)**: 비용.
- **Audio file 큰 (WAV)**: bandwidth. Opus / MP3.
- **Subtitle 없는 long audio**: a11y / SEO.
- **Watermark 없음**: deepfake risk.

## 🤖 LLM 활용 힌트
- ElevenLabs = quality. OpenAI = cheap. Cartesia = speed.
- Real-time = streaming + low-latency model.
- Self-host = Coqui / Piper.
- Consent + watermark + abuse detection.

## 🔗 관련 문서
- [[AI_Voice_Agent_Realtime]]
- [[AI_Multimodal_Vision_Patterns]]
- [[AI_LLM_Cost_Optimization]]