[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,380 @@
|
||||
---
|
||||
id: ai-voice-cloning-synthesis
|
||||
title: Voice Cloning / Synthesis — ElevenLabs / OpenAI / Self-host
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [ai, voice, tts, vibe-coding]
|
||||
tech_stack: { language: "TS / Python", applicable_to: ["Backend"] }
|
||||
applied_in: []
|
||||
aliases: [voice cloning, TTS, ElevenLabs, OpenAI TTS, Coqui, Bark, Piper, instant clone]
|
||||
---
|
||||
|
||||
# Voice Cloning / Synthesis
|
||||
|
||||
> Text → 사람 같은 음성. **ElevenLabs (sota), OpenAI TTS (cheap), Cartesia / PlayHT (fast). Self-host: Coqui / Bark / Piper**. 30 second sample = clone (ethical 주의).
|
||||
|
||||
## 📖 핵심 개념
|
||||
- TTS: Text-to-Speech.
|
||||
- Voice clone: 짧은 sample → personal voice.
|
||||
- Latency: real-time conversation = < 500ms.
|
||||
- Streaming: text 도착하며 동시 audio.
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### ElevenLabs (best quality)
|
||||
```ts
|
||||
import { ElevenLabsClient } from 'elevenlabs';
|
||||
|
||||
const client = new ElevenLabsClient({ apiKey });
|
||||
|
||||
const audio = await client.textToSpeech.convert('voice-id', {
|
||||
text: 'Hello world',
|
||||
modelId: 'eleven_turbo_v2_5',
|
||||
outputFormat: 'mp3_44100_128',
|
||||
});
|
||||
|
||||
// audio = AsyncIterable<Buffer>
|
||||
const chunks: Buffer[] = [];
|
||||
for await (const chunk of audio) chunks.push(chunk);
|
||||
const mp3 = Buffer.concat(chunks);
|
||||
```
|
||||
|
||||
### Streaming (real-time)
|
||||
```ts
|
||||
const stream = await client.textToSpeech.convertAsStream('voice-id', {
|
||||
text: longText,
|
||||
modelId: 'eleven_flash_v2_5', // 가장 빠름
|
||||
});
|
||||
|
||||
// Pipe to speaker
|
||||
for await (const chunk of stream) {
|
||||
speaker.write(chunk);
|
||||
}
|
||||
```
|
||||
|
||||
### Voice clone (instant)
|
||||
```ts
|
||||
const voice = await client.voices.add({
|
||||
name: 'Alice',
|
||||
files: [fs.createReadStream('alice-sample.mp3')], // 30s+
|
||||
description: 'Alice voice clone',
|
||||
});
|
||||
|
||||
// 사용
|
||||
const audio = await client.textToSpeech.convert(voice.voiceId, {
|
||||
text: 'Hi, this is Alice.',
|
||||
});
|
||||
```
|
||||
|
||||
### Voice design (text → voice)
|
||||
```ts
|
||||
const voice = await client.voices.design({
|
||||
description: 'A young energetic female voice with British accent',
|
||||
text: 'Sample text to test',
|
||||
});
|
||||
```
|
||||
|
||||
→ Description 만 — sample 없이.
|
||||
|
||||
### OpenAI TTS (cheap)
|
||||
```ts
|
||||
import OpenAI from 'openai';
|
||||
|
||||
const r = await openai.audio.speech.create({
|
||||
model: 'tts-1-hd', // 또는 tts-1
|
||||
voice: 'alloy', // alloy / echo / fable / onyx / nova / shimmer / ash / sage / coral
|
||||
input: text,
|
||||
response_format: 'mp3',
|
||||
speed: 1.0,
|
||||
});
|
||||
|
||||
const buf = Buffer.from(await r.arrayBuffer());
|
||||
fs.writeFileSync('out.mp3', buf);
|
||||
```
|
||||
|
||||
→ 6 voice. 빠름 + cheap. Clone 안 됨.
|
||||
|
||||
### gpt-4o-mini-tts (instructions, 2024+)
|
||||
```ts
|
||||
const r = await openai.audio.speech.create({
|
||||
model: 'gpt-4o-mini-tts',
|
||||
voice: 'coral',
|
||||
input: 'Welcome!',
|
||||
instructions: 'Speak in a cheerful and professional tone',
|
||||
});
|
||||
```
|
||||
|
||||
→ Instruction-following voice. 작은 control.
|
||||
|
||||
### Cartesia (fast, low-latency)
|
||||
```ts
|
||||
import { CartesiaClient } from '@cartesia/cartesia-js';
|
||||
|
||||
const cartesia = new CartesiaClient({ apiKey });
|
||||
|
||||
const ws = await cartesia.tts.websocket({
|
||||
containerSettings: { container: 'raw', encoding: 'pcm_s16le', sample_rate: 44100 },
|
||||
});
|
||||
|
||||
await ws.send({
|
||||
modelId: 'sonic-2',
|
||||
voice: { mode: 'id', id: 'voice-id' },
|
||||
transcript: 'Streaming text',
|
||||
});
|
||||
|
||||
ws.onMessage((msg) => {
|
||||
if (msg.type === 'chunk') speaker.write(Buffer.from(msg.data, 'base64'));
|
||||
});
|
||||
```
|
||||
|
||||
→ 75ms latency. Real-time agent.
|
||||
|
||||
### PlayHT
|
||||
```ts
|
||||
const r = await fetch('https://api.play.ht/api/v2/tts/stream', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
Authorization: `Bearer ${apiKey}`,
|
||||
'X-User-ID': userId,
|
||||
},
|
||||
body: JSON.stringify({
|
||||
text,
|
||||
voice: 'voice-id',
|
||||
output_format: 'mp3',
|
||||
voice_engine: 'PlayHT2.0-turbo',
|
||||
}),
|
||||
});
|
||||
|
||||
// Stream
|
||||
for await (const chunk of r.body!) {
|
||||
speaker.write(chunk);
|
||||
}
|
||||
```
|
||||
|
||||
### Self-host — Coqui XTTS
|
||||
```python
|
||||
from TTS.api import TTS
|
||||
|
||||
tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2').to('cuda')
|
||||
|
||||
tts.tts_to_file(
|
||||
text='Hello',
|
||||
speaker_wav='alice.wav', # voice clone (6s+)
|
||||
language='en',
|
||||
file_path='out.wav',
|
||||
)
|
||||
```
|
||||
|
||||
→ Self-host. GPU 필요.
|
||||
|
||||
### Self-host — Piper (fast CPU)
|
||||
```bash
|
||||
echo 'Hello' | piper --model en_US-lessac-medium.onnx --output_file out.wav
|
||||
```
|
||||
|
||||
→ ONNX 기반. CPU 도 OK.
|
||||
|
||||
### Bark (Suno)
|
||||
```python
|
||||
from bark import generate_audio, preload_models
|
||||
|
||||
preload_models()
|
||||
audio = generate_audio('Hello, [laughs] this is Bark!')
|
||||
```
|
||||
|
||||
→ 표현 (laughs, sigh, music) 가능.
|
||||
|
||||
### Voice agent (real-time conversation)
|
||||
```ts
|
||||
// 사용자 audio → STT → LLM → TTS → 응답 audio
|
||||
|
||||
const stt = whisper.transcribe(userAudio); // ~500ms
|
||||
const reply = await llm.complete(stt); // ~500ms
|
||||
const audio = await tts.stream(reply); // 75ms first chunk
|
||||
// Total: ~1075 ms 첫 audio
|
||||
```
|
||||
|
||||
→ Latency 가 핵심. Streaming + streaming + streaming.
|
||||
|
||||
### OpenAI Realtime API (all-in-one)
|
||||
```ts
|
||||
const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', {
|
||||
headers: { Authorization: `Bearer ${apiKey}` },
|
||||
});
|
||||
|
||||
ws.send({
|
||||
type: 'session.update',
|
||||
session: { voice: 'alloy', turn_detection: { type: 'server_vad' } },
|
||||
});
|
||||
|
||||
// 사용자 audio
|
||||
ws.send({ type: 'input_audio_buffer.append', audio: base64Pcm });
|
||||
|
||||
// 응답 audio (자동 stream)
|
||||
ws.on('message', (msg) => {
|
||||
const ev = JSON.parse(msg);
|
||||
if (ev.type === 'response.audio.delta') {
|
||||
speaker.write(Buffer.from(ev.delta, 'base64'));
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
→ STT + LLM + TTS = 한 model. Latency 가장 작음.
|
||||
|
||||
→ [[AI_Voice_Agent_Realtime]].
|
||||
|
||||
### Cost (대략)
|
||||
```
|
||||
ElevenLabs: $0.30 / 1K char (turbo)
|
||||
OpenAI TTS: $15 / 1M char (tts-1-hd)
|
||||
Cartesia: $0.20 / 1K char
|
||||
PlayHT: $0.30 / 1K char
|
||||
Self-host: GPU cost only
|
||||
|
||||
→ Big traffic = self-host.
|
||||
```
|
||||
|
||||
### Browser TTS (free, low quality)
|
||||
```ts
|
||||
const utt = new SpeechSynthesisUtterance('Hello');
|
||||
utt.voice = speechSynthesis.getVoices().find(v => v.name.includes('Samantha'));
|
||||
speechSynthesis.speak(utt);
|
||||
```
|
||||
|
||||
→ OS 의 voice. 무료 but quality 낮음.
|
||||
|
||||
### Audio formats
|
||||
```
|
||||
MP3: 범용, 작음
|
||||
Opus: Modern, 가장 작음
|
||||
PCM: Raw, real-time 친화
|
||||
WAV: Uncompressed, 큰
|
||||
M4A/AAC: iOS 친화
|
||||
|
||||
→ Streaming = PCM / Opus.
|
||||
Storage = MP3 / Opus.
|
||||
```
|
||||
|
||||
### Use cases
|
||||
```
|
||||
✅ Voice agent / chatbot
|
||||
✅ Audiobook
|
||||
✅ Accessibility
|
||||
✅ Game NPC
|
||||
✅ IVR (phone)
|
||||
✅ Notification audio
|
||||
✅ Podcast (auto-generation)
|
||||
```
|
||||
|
||||
### Voice clone — ethics / legal
|
||||
```
|
||||
- 사용자 동의 필수
|
||||
- 작가 / actor 의 voice rights
|
||||
- Misuse (deepfake, fraud)
|
||||
- Watermarking (몇 service)
|
||||
|
||||
ElevenLabs: 자동 watermark + abuse detection.
|
||||
```
|
||||
|
||||
→ 회사 / artist consent 필수.
|
||||
|
||||
### Multi-language
|
||||
```
|
||||
ElevenLabs: 32 lang
|
||||
OpenAI TTS: 11 lang (영어 best)
|
||||
Coqui XTTS: 17 lang
|
||||
```
|
||||
|
||||
### SSML (Speech Synthesis Markup Language)
|
||||
```xml
|
||||
<speak>
|
||||
Hello, <break time="500ms"/>
|
||||
<emphasis level="strong">important</emphasis> news.
|
||||
<prosody rate="slow">Speaking slowly</prosody>
|
||||
</speak>
|
||||
```
|
||||
|
||||
→ 일부 service 만 (Google, Azure).
|
||||
|
||||
### Voice activity detection (VAD)
|
||||
```ts
|
||||
// 사용자가 말 끝 감지
|
||||
import { VAD } from '@ricky0123/vad-web';
|
||||
|
||||
const vad = await VAD.new({
|
||||
onSpeechEnd: (audio) => {
|
||||
sendToSTT(audio);
|
||||
},
|
||||
});
|
||||
|
||||
vad.start();
|
||||
```
|
||||
|
||||
→ Silero / WebRTC VAD.
|
||||
|
||||
### Subtitle / caption (TTS 와 같이)
|
||||
```ts
|
||||
// ElevenLabs returns alignment
|
||||
const r = await client.textToSpeech.convertWithTimestamps('voice-id', { text });
|
||||
|
||||
// r.alignment = { characters, character_start_times, character_end_times }
|
||||
```
|
||||
|
||||
→ Karaoke-style subtitle.
|
||||
|
||||
### Evaluation
|
||||
```ts
|
||||
// Subjective:
|
||||
// 1. 자연스러움 (1-5)
|
||||
// 2. Clarity
|
||||
// 3. Emotion accuracy
|
||||
// 4. Pronunciation
|
||||
// 5. Speed
|
||||
|
||||
// Objective:
|
||||
// MOS score (Mean Opinion Score)
|
||||
// Word Error Rate (transcribe back)
|
||||
```
|
||||
|
||||
### Privacy
|
||||
```
|
||||
- 사용자 voice = sensitive
|
||||
- 외부 API = data 전송
|
||||
- Self-host = privacy 강
|
||||
- Anonymization 검토
|
||||
```
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 사용 | 추천 |
|
||||
|---|---|
|
||||
| Best quality + clone | ElevenLabs |
|
||||
| Cheap + general | OpenAI TTS |
|
||||
| Real-time agent | Cartesia / OpenAI Realtime |
|
||||
| Self-host | Coqui XTTS / Piper |
|
||||
| Browser only | speechSynthesis |
|
||||
| Multi-language | ElevenLabs |
|
||||
| Game / interactive | Bark / ElevenLabs |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **Voice clone + consent 없음**: 윤리 / 법적.
|
||||
- **Real-time + slow API**: 사용자 답답. Streaming.
|
||||
- **모든 곳 best model**: cost. Mix.
|
||||
- **Cache 없음 (같은 text 매번)**: 비용.
|
||||
- **Audio file 큰 (WAV)**: bandwidth. Opus / MP3.
|
||||
- **Subtitle 없는 long audio**: a11y / SEO.
|
||||
- **Watermark 없음**: deepfake risk.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- ElevenLabs = quality. OpenAI = cheap. Cartesia = speed.
|
||||
- Real-time = streaming + low-latency model.
|
||||
- Self-host = Coqui / Piper.
|
||||
- Consent + watermark + abuse detection.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[AI_Voice_Agent_Realtime]]
|
||||
- [[AI_Multimodal_Vision_Patterns]]
|
||||
- [[AI_LLM_Cost_Optimization]]
|
||||
Reference in New Issue
Block a user