[G1-Sync] Manual knowledge update

2026-05-09 22:47:42 +09:00
parent 93ec7e9056
commit 21ac3ed255
56 changed files with 22043 additions and 43 deletions
@@ -0,0 +1,380 @@
+---
+id: ai-voice-cloning-synthesis
+title: Voice Cloning / Synthesis — ElevenLabs / OpenAI / Self-host
+category: Coding
+status: draft
+source_trust_level: B
+verification_status: conceptual
+created_at: 2026-05-09
+updated_at: 2026-05-09
+tags: [ai, voice, tts, vibe-coding]
+tech_stack: { language: "TS / Python", applicable_to: ["Backend"] }
+applied_in: []
+aliases: [voice cloning, TTS, ElevenLabs, OpenAI TTS, Coqui, Bark, Piper, instant clone]
+---
+
+# Voice Cloning / Synthesis
+
+> Text → 사람 같은 음성. **ElevenLabs (sota), OpenAI TTS (cheap), Cartesia / PlayHT (fast). Self-host: Coqui / Bark / Piper**. 30 second sample = clone (ethical 주의).
+
+## 📖 핵심 개념
+- TTS: Text-to-Speech.
+- Voice clone: 짧은 sample → personal voice.
+- Latency: real-time conversation = < 500ms.
+- Streaming: text 도착하며 동시 audio.
+
+## 💻 코드 패턴
+
+### ElevenLabs (best quality)
+```ts
+import { ElevenLabsClient } from 'elevenlabs';
+
+const client = new ElevenLabsClient({ apiKey });
+
+const audio = await client.textToSpeech.convert('voice-id', {
+  text: 'Hello world',
+  modelId: 'eleven_turbo_v2_5',
+  outputFormat: 'mp3_44100_128',
+});
+
+// audio = AsyncIterable<Buffer>
+const chunks: Buffer[] = [];
+for await (const chunk of audio) chunks.push(chunk);
+const mp3 = Buffer.concat(chunks);
+```
+
+### Streaming (real-time)
+```ts
+const stream = await client.textToSpeech.convertAsStream('voice-id', {
+  text: longText,
+  modelId: 'eleven_flash_v2_5',  // 가장 빠름
+});
+
+// Pipe to speaker
+for await (const chunk of stream) {
+  speaker.write(chunk);
+}
+```
+
+### Voice clone (instant)
+```ts
+const voice = await client.voices.add({
+  name: 'Alice',
+  files: [fs.createReadStream('alice-sample.mp3')],  // 30s+
+  description: 'Alice voice clone',
+});
+
+// 사용
+const audio = await client.textToSpeech.convert(voice.voiceId, {
+  text: 'Hi, this is Alice.',
+});
+```
+
+### Voice design (text → voice)
+```ts
+const voice = await client.voices.design({
+  description: 'A young energetic female voice with British accent',
+  text: 'Sample text to test',
+});
+```
+
+→ Description 만 — sample 없이.
+
+### OpenAI TTS (cheap)
+```ts
+import OpenAI from 'openai';
+
+const r = await openai.audio.speech.create({
+  model: 'tts-1-hd',  // 또는 tts-1
+  voice: 'alloy',  // alloy / echo / fable / onyx / nova / shimmer / ash / sage / coral
+  input: text,
+  response_format: 'mp3',
+  speed: 1.0,
+});
+
+const buf = Buffer.from(await r.arrayBuffer());
+fs.writeFileSync('out.mp3', buf);
+```
+
+→ 6 voice. 빠름 + cheap. Clone 안 됨.
+
+### gpt-4o-mini-tts (instructions, 2024+)
+```ts
+const r = await openai.audio.speech.create({
+  model: 'gpt-4o-mini-tts',
+  voice: 'coral',
+  input: 'Welcome!',
+  instructions: 'Speak in a cheerful and professional tone',
+});
+```
+
+→ Instruction-following voice. 작은 control.
+
+### Cartesia (fast, low-latency)
+```ts
+import { CartesiaClient } from '@cartesia/cartesia-js';
+
+const cartesia = new CartesiaClient({ apiKey });
+
+const ws = await cartesia.tts.websocket({
+  containerSettings: { container: 'raw', encoding: 'pcm_s16le', sample_rate: 44100 },
+});
+
+await ws.send({
+  modelId: 'sonic-2',
+  voice: { mode: 'id', id: 'voice-id' },
+  transcript: 'Streaming text',
+});
+
+ws.onMessage((msg) => {
+  if (msg.type === 'chunk') speaker.write(Buffer.from(msg.data, 'base64'));
+});
+```
+
+→ 75ms latency. Real-time agent.
+
+### PlayHT
+```ts
+const r = await fetch('https://api.play.ht/api/v2/tts/stream', {
+  method: 'POST',
+  headers: {
+    Authorization: `Bearer ${apiKey}`,
+    'X-User-ID': userId,
+  },
+  body: JSON.stringify({
+    text,
+    voice: 'voice-id',
+    output_format: 'mp3',
+    voice_engine: 'PlayHT2.0-turbo',
+  }),
+});
+
+// Stream
+for await (const chunk of r.body!) {
+  speaker.write(chunk);
+}
+```
+
+### Self-host — Coqui XTTS
+```python
+from TTS.api import TTS
+
+tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2').to('cuda')
+
+tts.tts_to_file(
+    text='Hello',
+    speaker_wav='alice.wav',  # voice clone (6s+)
+    language='en',
+    file_path='out.wav',
+)
+```
+
+→ Self-host. GPU 필요.
+
+### Self-host — Piper (fast CPU)
+```bash
+echo 'Hello' | piper --model en_US-lessac-medium.onnx --output_file out.wav
+```
+
+→ ONNX 기반. CPU 도 OK.
+
+### Bark (Suno)
+```python
+from bark import generate_audio, preload_models
+
+preload_models()
+audio = generate_audio('Hello, [laughs] this is Bark!')
+```
+
+→ 표현 (laughs, sigh, music) 가능.
+
+### Voice agent (real-time conversation)
+```ts
+// 사용자 audio → STT → LLM → TTS → 응답 audio
+
+const stt = whisper.transcribe(userAudio);  // ~500ms
+const reply = await llm.complete(stt);       // ~500ms
+const audio = await tts.stream(reply);       // 75ms first chunk
+// Total: ~1075 ms 첫 audio
+```
+
+→ Latency 가 핵심. Streaming + streaming + streaming.
+
+### OpenAI Realtime API (all-in-one)
+```ts
+const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', {
+  headers: { Authorization: `Bearer ${apiKey}` },
+});
+
+ws.send({
+  type: 'session.update',
+  session: { voice: 'alloy', turn_detection: { type: 'server_vad' } },
+});
+
+// 사용자 audio
+ws.send({ type: 'input_audio_buffer.append', audio: base64Pcm });
+
+// 응답 audio (자동 stream)
+ws.on('message', (msg) => {
+  const ev = JSON.parse(msg);
+  if (ev.type === 'response.audio.delta') {
+    speaker.write(Buffer.from(ev.delta, 'base64'));
+  }
+});
+```
+
+→ STT + LLM + TTS = 한 model. Latency 가장 작음.
+
+→ [[AI_Voice_Agent_Realtime]].
+
+### Cost (대략)
+```
+ElevenLabs:  $0.30 / 1K char (turbo)
+OpenAI TTS:  $15 / 1M char (tts-1-hd)
+Cartesia:    $0.20 / 1K char
+PlayHT:      $0.30 / 1K char
+Self-host:   GPU cost only
+
+→ Big traffic = self-host.
+```
+
+### Browser TTS (free, low quality)
+```ts
+const utt = new SpeechSynthesisUtterance('Hello');
+utt.voice = speechSynthesis.getVoices().find(v => v.name.includes('Samantha'));
+speechSynthesis.speak(utt);
+```
+
+→ OS 의 voice. 무료 but quality 낮음.
+
+### Audio formats
+```
+MP3:    범용, 작음
+Opus:   Modern, 가장 작음
+PCM:    Raw, real-time 친화
+WAV:    Uncompressed, 큰
+M4A/AAC: iOS 친화
+
+→ Streaming = PCM / Opus.
+   Storage = MP3 / Opus.
+```
+
+### Use cases
+```
+✅ Voice agent / chatbot
+✅ Audiobook
+✅ Accessibility
+✅ Game NPC
+✅ IVR (phone)
+✅ Notification audio
+✅ Podcast (auto-generation)
+```
+
+### Voice clone — ethics / legal
+```
+- 사용자 동의 필수
+- 작가 / actor 의 voice rights
+- Misuse (deepfake, fraud)
+- Watermarking (몇 service)
+
+ElevenLabs: 자동 watermark + abuse detection.
+```
+
+→ 회사 / artist consent 필수.
+
+### Multi-language
+```
+ElevenLabs: 32 lang
+OpenAI TTS: 11 lang (영어 best)
+Coqui XTTS: 17 lang
+```
+
+### SSML (Speech Synthesis Markup Language)
+```xml
+<speak>
+  Hello, <break time="500ms"/> 
+  <emphasis level="strong">important</emphasis> news.
+  <prosody rate="slow">Speaking slowly</prosody>
+</speak>
+```
+
+→ 일부 service 만 (Google, Azure).
+
+### Voice activity detection (VAD)
+```ts
+// 사용자가 말 끝 감지
+import { VAD } from '@ricky0123/vad-web';
+
+const vad = await VAD.new({
+  onSpeechEnd: (audio) => {
+    sendToSTT(audio);
+  },
+});
+
+vad.start();
+```
+
+→ Silero / WebRTC VAD.
+
+### Subtitle / caption (TTS 와 같이)
+```ts
+// ElevenLabs returns alignment
+const r = await client.textToSpeech.convertWithTimestamps('voice-id', { text });
+
+// r.alignment = { characters, character_start_times, character_end_times }
+```
+
+→ Karaoke-style subtitle.
+
+### Evaluation
+```ts
+// Subjective:
+// 1. 자연스러움 (1-5)
+// 2. Clarity
+// 3. Emotion accuracy
+// 4. Pronunciation
+// 5. Speed
+
+// Objective:
+// MOS score (Mean Opinion Score)
+// Word Error Rate (transcribe back)
+```
+
+### Privacy
+```
+- 사용자 voice = sensitive
+- 외부 API = data 전송
+- Self-host = privacy 강
+- Anonymization 검토
+```
+
+## 🤔 의사결정 기준
+| 사용 | 추천 |
+|---|---|
+| Best quality + clone | ElevenLabs |
+| Cheap + general | OpenAI TTS |
+| Real-time agent | Cartesia / OpenAI Realtime |
+| Self-host | Coqui XTTS / Piper |
+| Browser only | speechSynthesis |
+| Multi-language | ElevenLabs |
+| Game / interactive | Bark / ElevenLabs |
+
+## ❌ 안티패턴
+- **Voice clone + consent 없음**: 윤리 / 법적.
+- **Real-time + slow API**: 사용자 답답. Streaming.
+- **모든 곳 best model**: cost. Mix.
+- **Cache 없음 (같은 text 매번)**: 비용.
+- **Audio file 큰 (WAV)**: bandwidth. Opus / MP3.
+- **Subtitle 없는 long audio**: a11y / SEO.
+- **Watermark 없음**: deepfake risk.
+
+## 🤖 LLM 활용 힌트
+- ElevenLabs = quality. OpenAI = cheap. Cartesia = speed.
+- Real-time = streaming + low-latency model.
+- Self-host = Coqui / Piper.
+- Consent + watermark + abuse detection.
+
+## 🔗 관련 문서
+- [[AI_Voice_Agent_Realtime]]
+- [[AI_Multimodal_Vision_Patterns]]
+- [[AI_LLM_Cost_Optimization]]