8.5 KiB
8.5 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ai-voice-cloning-synthesis | Voice Cloning / Synthesis — ElevenLabs / OpenAI / Self-host | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Voice Cloning / Synthesis
Text → 사람 같은 음성. ElevenLabs (sota), OpenAI TTS (cheap), Cartesia / PlayHT (fast). Self-host: Coqui / Bark / Piper. 30 second sample = clone (ethical 주의).
📖 핵심 개념
- TTS: Text-to-Speech.
- Voice clone: 짧은 sample → personal voice.
- Latency: real-time conversation = < 500ms.
- Streaming: text 도착하며 동시 audio.
💻 코드 패턴
ElevenLabs (best quality)
import { ElevenLabsClient } from 'elevenlabs';
const client = new ElevenLabsClient({ apiKey });
const audio = await client.textToSpeech.convert('voice-id', {
text: 'Hello world',
modelId: 'eleven_turbo_v2_5',
outputFormat: 'mp3_44100_128',
});
// audio = AsyncIterable<Buffer>
const chunks: Buffer[] = [];
for await (const chunk of audio) chunks.push(chunk);
const mp3 = Buffer.concat(chunks);
Streaming (real-time)
const stream = await client.textToSpeech.convertAsStream('voice-id', {
text: longText,
modelId: 'eleven_flash_v2_5', // 가장 빠름
});
// Pipe to speaker
for await (const chunk of stream) {
speaker.write(chunk);
}
Voice clone (instant)
const voice = await client.voices.add({
name: 'Alice',
files: [fs.createReadStream('alice-sample.mp3')], // 30s+
description: 'Alice voice clone',
});
// 사용
const audio = await client.textToSpeech.convert(voice.voiceId, {
text: 'Hi, this is Alice.',
});
Voice design (text → voice)
const voice = await client.voices.design({
description: 'A young energetic female voice with British accent',
text: 'Sample text to test',
});
→ Description 만 — sample 없이.
OpenAI TTS (cheap)
import OpenAI from 'openai';
const r = await openai.audio.speech.create({
model: 'tts-1-hd', // 또는 tts-1
voice: 'alloy', // alloy / echo / fable / onyx / nova / shimmer / ash / sage / coral
input: text,
response_format: 'mp3',
speed: 1.0,
});
const buf = Buffer.from(await r.arrayBuffer());
fs.writeFileSync('out.mp3', buf);
→ 6 voice. 빠름 + cheap. Clone 안 됨.
gpt-4o-mini-tts (instructions, 2024+)
const r = await openai.audio.speech.create({
model: 'gpt-4o-mini-tts',
voice: 'coral',
input: 'Welcome!',
instructions: 'Speak in a cheerful and professional tone',
});
→ Instruction-following voice. 작은 control.
Cartesia (fast, low-latency)
import { CartesiaClient } from '@cartesia/cartesia-js';
const cartesia = new CartesiaClient({ apiKey });
const ws = await cartesia.tts.websocket({
containerSettings: { container: 'raw', encoding: 'pcm_s16le', sample_rate: 44100 },
});
await ws.send({
modelId: 'sonic-2',
voice: { mode: 'id', id: 'voice-id' },
transcript: 'Streaming text',
});
ws.onMessage((msg) => {
if (msg.type === 'chunk') speaker.write(Buffer.from(msg.data, 'base64'));
});
→ 75ms latency. Real-time agent.
PlayHT
const r = await fetch('https://api.play.ht/api/v2/tts/stream', {
method: 'POST',
headers: {
Authorization: `Bearer ${apiKey}`,
'X-User-ID': userId,
},
body: JSON.stringify({
text,
voice: 'voice-id',
output_format: 'mp3',
voice_engine: 'PlayHT2.0-turbo',
}),
});
// Stream
for await (const chunk of r.body!) {
speaker.write(chunk);
}
Self-host — Coqui XTTS
from TTS.api import TTS
tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2').to('cuda')
tts.tts_to_file(
text='Hello',
speaker_wav='alice.wav', # voice clone (6s+)
language='en',
file_path='out.wav',
)
→ Self-host. GPU 필요.
Self-host — Piper (fast CPU)
echo 'Hello' | piper --model en_US-lessac-medium.onnx --output_file out.wav
→ ONNX 기반. CPU 도 OK.
Bark (Suno)
from bark import generate_audio, preload_models
preload_models()
audio = generate_audio('Hello, [laughs] this is Bark!')
→ 표현 (laughs, sigh, music) 가능.
Voice agent (real-time conversation)
// 사용자 audio → STT → LLM → TTS → 응답 audio
const stt = whisper.transcribe(userAudio); // ~500ms
const reply = await llm.complete(stt); // ~500ms
const audio = await tts.stream(reply); // 75ms first chunk
// Total: ~1075 ms 첫 audio
→ Latency 가 핵심. Streaming + streaming + streaming.
OpenAI Realtime API (all-in-one)
const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', {
headers: { Authorization: `Bearer ${apiKey}` },
});
ws.send({
type: 'session.update',
session: { voice: 'alloy', turn_detection: { type: 'server_vad' } },
});
// 사용자 audio
ws.send({ type: 'input_audio_buffer.append', audio: base64Pcm });
// 응답 audio (자동 stream)
ws.on('message', (msg) => {
const ev = JSON.parse(msg);
if (ev.type === 'response.audio.delta') {
speaker.write(Buffer.from(ev.delta, 'base64'));
}
});
→ STT + LLM + TTS = 한 model. Latency 가장 작음.
Cost (대략)
ElevenLabs: $0.30 / 1K char (turbo)
OpenAI TTS: $15 / 1M char (tts-1-hd)
Cartesia: $0.20 / 1K char
PlayHT: $0.30 / 1K char
Self-host: GPU cost only
→ Big traffic = self-host.
Browser TTS (free, low quality)
const utt = new SpeechSynthesisUtterance('Hello');
utt.voice = speechSynthesis.getVoices().find(v => v.name.includes('Samantha'));
speechSynthesis.speak(utt);
→ OS 의 voice. 무료 but quality 낮음.
Audio formats
MP3: 범용, 작음
Opus: Modern, 가장 작음
PCM: Raw, real-time 친화
WAV: Uncompressed, 큰
M4A/AAC: iOS 친화
→ Streaming = PCM / Opus.
Storage = MP3 / Opus.
Use cases
✅ Voice agent / chatbot
✅ Audiobook
✅ Accessibility
✅ Game NPC
✅ IVR (phone)
✅ Notification audio
✅ Podcast (auto-generation)
Voice clone — ethics / legal
- 사용자 동의 필수
- 작가 / actor 의 voice rights
- Misuse (deepfake, fraud)
- Watermarking (몇 service)
ElevenLabs: 자동 watermark + abuse detection.
→ 회사 / artist consent 필수.
Multi-language
ElevenLabs: 32 lang
OpenAI TTS: 11 lang (영어 best)
Coqui XTTS: 17 lang
SSML (Speech Synthesis Markup Language)
<speak>
Hello, <break time="500ms"/>
<emphasis level="strong">important</emphasis> news.
<prosody rate="slow">Speaking slowly</prosody>
</speak>
→ 일부 service 만 (Google, Azure).
Voice activity detection (VAD)
// 사용자가 말 끝 감지
import { VAD } from '@ricky0123/vad-web';
const vad = await VAD.new({
onSpeechEnd: (audio) => {
sendToSTT(audio);
},
});
vad.start();
→ Silero / WebRTC VAD.
Subtitle / caption (TTS 와 같이)
// ElevenLabs returns alignment
const r = await client.textToSpeech.convertWithTimestamps('voice-id', { text });
// r.alignment = { characters, character_start_times, character_end_times }
→ Karaoke-style subtitle.
Evaluation
// Subjective:
// 1. 자연스러움 (1-5)
// 2. Clarity
// 3. Emotion accuracy
// 4. Pronunciation
// 5. Speed
// Objective:
// MOS score (Mean Opinion Score)
// Word Error Rate (transcribe back)
Privacy
- 사용자 voice = sensitive
- 외부 API = data 전송
- Self-host = privacy 강
- Anonymization 검토
🤔 의사결정 기준
| 사용 | 추천 |
|---|---|
| Best quality + clone | ElevenLabs |
| Cheap + general | OpenAI TTS |
| Real-time agent | Cartesia / OpenAI Realtime |
| Self-host | Coqui XTTS / Piper |
| Browser only | speechSynthesis |
| Multi-language | ElevenLabs |
| Game / interactive | Bark / ElevenLabs |
❌ 안티패턴
- Voice clone + consent 없음: 윤리 / 법적.
- Real-time + slow API: 사용자 답답. Streaming.
- 모든 곳 best model: cost. Mix.
- Cache 없음 (같은 text 매번): 비용.
- Audio file 큰 (WAV): bandwidth. Opus / MP3.
- Subtitle 없는 long audio: a11y / SEO.
- Watermark 없음: deepfake risk.
🤖 LLM 활용 힌트
- ElevenLabs = quality. OpenAI = cheap. Cartesia = speed.
- Real-time = streaming + low-latency model.
- Self-host = Coqui / Piper.
- Consent + watermark + abuse detection.