Files
2nd/10_Wiki/Topics/Coding/AI_Voice_Cloning_Synthesis.md
T
2026-05-09 22:47:42 +09:00

8.5 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-voice-cloning-synthesis Voice Cloning / Synthesis — ElevenLabs / OpenAI / Self-host Coding draft B conceptual 2026-05-09 2026-05-09
ai
voice
tts
vibe-coding
language applicable_to
TS / Python
Backend
voice cloning
TTS
ElevenLabs
OpenAI TTS
Coqui
Bark
Piper
instant clone

Voice Cloning / Synthesis

Text → 사람 같은 음성. ElevenLabs (sota), OpenAI TTS (cheap), Cartesia / PlayHT (fast). Self-host: Coqui / Bark / Piper. 30 second sample = clone (ethical 주의).

📖 핵심 개념

  • TTS: Text-to-Speech.
  • Voice clone: 짧은 sample → personal voice.
  • Latency: real-time conversation = < 500ms.
  • Streaming: text 도착하며 동시 audio.

💻 코드 패턴

ElevenLabs (best quality)

import { ElevenLabsClient } from 'elevenlabs';

const client = new ElevenLabsClient({ apiKey });

const audio = await client.textToSpeech.convert('voice-id', {
  text: 'Hello world',
  modelId: 'eleven_turbo_v2_5',
  outputFormat: 'mp3_44100_128',
});

// audio = AsyncIterable<Buffer>
const chunks: Buffer[] = [];
for await (const chunk of audio) chunks.push(chunk);
const mp3 = Buffer.concat(chunks);

Streaming (real-time)

const stream = await client.textToSpeech.convertAsStream('voice-id', {
  text: longText,
  modelId: 'eleven_flash_v2_5',  // 가장 빠름
});

// Pipe to speaker
for await (const chunk of stream) {
  speaker.write(chunk);
}

Voice clone (instant)

const voice = await client.voices.add({
  name: 'Alice',
  files: [fs.createReadStream('alice-sample.mp3')],  // 30s+
  description: 'Alice voice clone',
});

// 사용
const audio = await client.textToSpeech.convert(voice.voiceId, {
  text: 'Hi, this is Alice.',
});

Voice design (text → voice)

const voice = await client.voices.design({
  description: 'A young energetic female voice with British accent',
  text: 'Sample text to test',
});

→ Description 만 — sample 없이.

OpenAI TTS (cheap)

import OpenAI from 'openai';

const r = await openai.audio.speech.create({
  model: 'tts-1-hd',  // 또는 tts-1
  voice: 'alloy',  // alloy / echo / fable / onyx / nova / shimmer / ash / sage / coral
  input: text,
  response_format: 'mp3',
  speed: 1.0,
});

const buf = Buffer.from(await r.arrayBuffer());
fs.writeFileSync('out.mp3', buf);

→ 6 voice. 빠름 + cheap. Clone 안 됨.

gpt-4o-mini-tts (instructions, 2024+)

const r = await openai.audio.speech.create({
  model: 'gpt-4o-mini-tts',
  voice: 'coral',
  input: 'Welcome!',
  instructions: 'Speak in a cheerful and professional tone',
});

→ Instruction-following voice. 작은 control.

Cartesia (fast, low-latency)

import { CartesiaClient } from '@cartesia/cartesia-js';

const cartesia = new CartesiaClient({ apiKey });

const ws = await cartesia.tts.websocket({
  containerSettings: { container: 'raw', encoding: 'pcm_s16le', sample_rate: 44100 },
});

await ws.send({
  modelId: 'sonic-2',
  voice: { mode: 'id', id: 'voice-id' },
  transcript: 'Streaming text',
});

ws.onMessage((msg) => {
  if (msg.type === 'chunk') speaker.write(Buffer.from(msg.data, 'base64'));
});

→ 75ms latency. Real-time agent.

PlayHT

const r = await fetch('https://api.play.ht/api/v2/tts/stream', {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${apiKey}`,
    'X-User-ID': userId,
  },
  body: JSON.stringify({
    text,
    voice: 'voice-id',
    output_format: 'mp3',
    voice_engine: 'PlayHT2.0-turbo',
  }),
});

// Stream
for await (const chunk of r.body!) {
  speaker.write(chunk);
}

Self-host — Coqui XTTS

from TTS.api import TTS

tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2').to('cuda')

tts.tts_to_file(
    text='Hello',
    speaker_wav='alice.wav',  # voice clone (6s+)
    language='en',
    file_path='out.wav',
)

→ Self-host. GPU 필요.

Self-host — Piper (fast CPU)

echo 'Hello' | piper --model en_US-lessac-medium.onnx --output_file out.wav

→ ONNX 기반. CPU 도 OK.

Bark (Suno)

from bark import generate_audio, preload_models

preload_models()
audio = generate_audio('Hello, [laughs] this is Bark!')

→ 표현 (laughs, sigh, music) 가능.

Voice agent (real-time conversation)

// 사용자 audio → STT → LLM → TTS → 응답 audio

const stt = whisper.transcribe(userAudio);  // ~500ms
const reply = await llm.complete(stt);       // ~500ms
const audio = await tts.stream(reply);       // 75ms first chunk
// Total: ~1075 ms 첫 audio

→ Latency 가 핵심. Streaming + streaming + streaming.

OpenAI Realtime API (all-in-one)

const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', {
  headers: { Authorization: `Bearer ${apiKey}` },
});

ws.send({
  type: 'session.update',
  session: { voice: 'alloy', turn_detection: { type: 'server_vad' } },
});

// 사용자 audio
ws.send({ type: 'input_audio_buffer.append', audio: base64Pcm });

// 응답 audio (자동 stream)
ws.on('message', (msg) => {
  const ev = JSON.parse(msg);
  if (ev.type === 'response.audio.delta') {
    speaker.write(Buffer.from(ev.delta, 'base64'));
  }
});

→ STT + LLM + TTS = 한 model. Latency 가장 작음.

AI_Voice_Agent_Realtime.

Cost (대략)

ElevenLabs:  $0.30 / 1K char (turbo)
OpenAI TTS:  $15 / 1M char (tts-1-hd)
Cartesia:    $0.20 / 1K char
PlayHT:      $0.30 / 1K char
Self-host:   GPU cost only

→ Big traffic = self-host.

Browser TTS (free, low quality)

const utt = new SpeechSynthesisUtterance('Hello');
utt.voice = speechSynthesis.getVoices().find(v => v.name.includes('Samantha'));
speechSynthesis.speak(utt);

→ OS 의 voice. 무료 but quality 낮음.

Audio formats

MP3:    범용, 작음
Opus:   Modern, 가장 작음
PCM:    Raw, real-time 친화
WAV:    Uncompressed, 큰
M4A/AAC: iOS 친화

→ Streaming = PCM / Opus.
   Storage = MP3 / Opus.

Use cases

✅ Voice agent / chatbot
✅ Audiobook
✅ Accessibility
✅ Game NPC
✅ IVR (phone)
✅ Notification audio
✅ Podcast (auto-generation)
- 사용자 동의 필수
- 작가 / actor 의 voice rights
- Misuse (deepfake, fraud)
- Watermarking (몇 service)

ElevenLabs: 자동 watermark + abuse detection.

→ 회사 / artist consent 필수.

Multi-language

ElevenLabs: 32 lang
OpenAI TTS: 11 lang (영어 best)
Coqui XTTS: 17 lang

SSML (Speech Synthesis Markup Language)

<speak>
  Hello, <break time="500ms"/> 
  <emphasis level="strong">important</emphasis> news.
  <prosody rate="slow">Speaking slowly</prosody>
</speak>

→ 일부 service 만 (Google, Azure).

Voice activity detection (VAD)

// 사용자가 말 끝 감지
import { VAD } from '@ricky0123/vad-web';

const vad = await VAD.new({
  onSpeechEnd: (audio) => {
    sendToSTT(audio);
  },
});

vad.start();

→ Silero / WebRTC VAD.

Subtitle / caption (TTS 와 같이)

// ElevenLabs returns alignment
const r = await client.textToSpeech.convertWithTimestamps('voice-id', { text });

// r.alignment = { characters, character_start_times, character_end_times }

→ Karaoke-style subtitle.

Evaluation

// Subjective:
// 1. 자연스러움 (1-5)
// 2. Clarity
// 3. Emotion accuracy
// 4. Pronunciation
// 5. Speed

// Objective:
// MOS score (Mean Opinion Score)
// Word Error Rate (transcribe back)

Privacy

- 사용자 voice = sensitive
- 외부 API = data 전송
- Self-host = privacy 강
- Anonymization 검토

🤔 의사결정 기준

사용 추천
Best quality + clone ElevenLabs
Cheap + general OpenAI TTS
Real-time agent Cartesia / OpenAI Realtime
Self-host Coqui XTTS / Piper
Browser only speechSynthesis
Multi-language ElevenLabs
Game / interactive Bark / ElevenLabs

안티패턴

  • Voice clone + consent 없음: 윤리 / 법적.
  • Real-time + slow API: 사용자 답답. Streaming.
  • 모든 곳 best model: cost. Mix.
  • Cache 없음 (같은 text 매번): 비용.
  • Audio file 큰 (WAV): bandwidth. Opus / MP3.
  • Subtitle 없는 long audio: a11y / SEO.
  • Watermark 없음: deepfake risk.

🤖 LLM 활용 힌트

  • ElevenLabs = quality. OpenAI = cheap. Cartesia = speed.
  • Real-time = streaming + low-latency model.
  • Self-host = Coqui / Piper.
  • Consent + watermark + abuse detection.

🔗 관련 문서