[G1-Sync] Manual knowledge update

2026-05-09 21:08:02 +09:00
parent f0befc887a
commit 93ec7e9056
363 changed files with 68333 additions and 64 deletions
@@ -0,0 +1,214 @@
+---
+id: ai-voice-agent-realtime
+title: Voice Agent — Realtime API / 양방향 음성
+category: Coding
+status: draft
+source_trust_level: B
+verification_status: conceptual
+created_at: 2026-05-09
+updated_at: 2026-05-09
+tags: [ai, voice, realtime, vibe-coding]
+tech_stack: { language: "TS / WebRTC / WebSocket", applicable_to: ["Backend", "Frontend"] }
+applied_in: []
+aliases: [voice agent, OpenAI Realtime, Pipecat, LiveKit, VAD, interruption]
+---
+
+# Voice Agent
+
+> 사용자 말 → LLM 응답 → 음성. **OpenAI Realtime API / Pipecat / LiveKit Agents** 가 표준. **Latency 가 핵심** (<500ms feel natural). VAD + interruption + back-channel.
+
+## 📖 핵심 개념
+- VAD (Voice Activity Detection): 사용자가 말하는지.
+- Turn-taking: 말 끝 인식.
+- Interruption: 사용자가 끼어들기 → 모델 멈춤.
+- Latency budget: 음성 → text → LLM → text → 음성 = 보통 <1s.
+
+## 💻 코드 패턴
+
+### OpenAI Realtime (WebSocket)
+```ts
+import WebSocket from 'ws';
+
+const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', {
+  headers: {
+    'Authorization': `Bearer ${apiKey}`,
+    'OpenAI-Beta': 'realtime=v1',
+  },
+});
+
+ws.on('open', () => {
+  ws.send(JSON.stringify({
+    type: 'session.update',
+    session: {
+      modalities: ['text', 'audio'],
+      instructions: 'You are a helpful voice assistant. Be concise.',
+      voice: 'alloy',
+      input_audio_format: 'pcm16',
+      output_audio_format: 'pcm16',
+      input_audio_transcription: { model: 'whisper-1' },
+      turn_detection: { type: 'server_vad', threshold: 0.5, silence_duration_ms: 500 },
+      tools: [{
+        type: 'function', name: 'search', description: '...', parameters: {...},
+      }],
+    },
+  }));
+});
+
+// 사용자 음성 chunk → 보내기
+audioInput.on('data', (pcm) => {
+  ws.send(JSON.stringify({
+    type: 'input_audio_buffer.append',
+    audio: pcm.toString('base64'),
+  }));
+});
+
+// 응답 받기
+ws.on('message', (msg) => {
+  const ev = JSON.parse(msg.toString());
+  if (ev.type === 'response.audio.delta') {
+    const audio = Buffer.from(ev.delta, 'base64');
+    speaker.write(audio);
+  }
+  if (ev.type === 'response.function_call_arguments.done') {
+    handleFunctionCall(ev);
+  }
+});
+```
+
+### WebRTC (browser, 더 좋은 latency)
+```ts
+const pc = new RTCPeerConnection();
+const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
+stream.getTracks().forEach(t => pc.addTrack(t, stream));
+
+const remoteAudio = new Audio();
+pc.ontrack = (e) => { remoteAudio.srcObject = e.streams[0]; remoteAudio.play(); };
+
+const offer = await pc.createOffer();
+await pc.setLocalDescription(offer);
+
+// SDP 를 OpenAI Realtime 에 보내고 answer 받음
+const r = await fetch('https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', {
+  method: 'POST', headers: { Authorization: `Bearer ${ephemeralKey}`, 'Content-Type': 'application/sdp' },
+  body: offer.sdp,
+});
+const answer = await r.text();
+await pc.setRemoteDescription({ type: 'answer', sdp: answer });
+```
+
+### Ephemeral key (클라용 짧은 token)
+```ts
+// 서버
+const r = await fetch('https://api.openai.com/v1/realtime/sessions', {
+  method: 'POST',
+  headers: { Authorization: `Bearer ${apiKey}` },
+  body: JSON.stringify({ model: 'gpt-4o-realtime-preview' }),
+});
+const { client_secret } = await r.json();
+// client_secret.value → 클라에 send (1분 valid)
+```
+
+### Pipecat (Python framework)
+```python
+from pipecat.pipeline.pipeline import Pipeline
+from pipecat.pipeline.runner import PipelineRunner
+from pipecat.services.openai_realtime_beta import OpenAIRealtimeBetaLLMService
+from pipecat.transports.network.websocket_server import WebsocketServerTransport
+
+llm = OpenAIRealtimeBetaLLMService(api_key=API_KEY)
+pipeline = Pipeline([transport.input(), llm, transport.output()])
+runner = PipelineRunner()
+await runner.run(pipeline)
+```
+
+### Interruption
+```ts
+// 사용자가 말 시작 → 모델 응답 cancel
+ws.on('message', (msg) => {
+  const ev = JSON.parse(msg.toString());
+  if (ev.type === 'input_audio_buffer.speech_started') {
+    ws.send(JSON.stringify({ type: 'response.cancel' }));
+  }
+});
+```
+
+### Tool use
+```ts
+{
+  type: 'response.function_call_arguments.done',
+  call_id: '...',
+  arguments: '{"query":"weather"}'
+}
+
+// 실행 후
+ws.send(JSON.stringify({
+  type: 'conversation.item.create',
+  item: {
+    type: 'function_call_output',
+    call_id: '...',
+    output: JSON.stringify(result),
+  },
+}));
+ws.send(JSON.stringify({ type: 'response.create' }));
+```
+
+### 음질 / latency 팁
+- 16kHz PCM mono.
+- Echo cancellation (browser native).
+- Server VAD vs client VAD — 환경별.
+- WebRTC > WebSocket (latency).
+- Background music / noise → suppression.
+
+### 비용
+```
+Audio input:  $0.10 / minute (대략)
+Audio output: $0.20 / minute
+→ 5분 통화 = $1.50
+```
+
+LLM-only (Whisper + GPT + TTS) 가 더 싼 경우도 — latency trade.
+
+### LiveKit Agents (alternative)
+```python
+from livekit.agents import AutoSubscribe, JobContext, llm
+from livekit.plugins import openai, silero
+
+@agent
+async def entrypoint(ctx: JobContext):
+    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
+    agent = VoiceAssistant(
+        vad=silero.VAD.load(),
+        stt=openai.STT(),
+        llm=openai.LLM(),
+        tts=openai.TTS(),
+    )
+    agent.start(ctx.room)
+```
+
+## 🤔 의사결정 기준
+| 상황 | 추천 |
+|---|---|
+| 빠른 prototype | OpenAI Realtime |
+| Production 강력 framework | Pipecat / LiveKit Agents |
+| 저비용 / 자체 stack | Whisper + LLM + ElevenLabs TTS |
+| 전화 통합 | Twilio + Pipecat |
+| 매우 low latency (gaming) | 자체 stack + edge |
+
+## ❌ 안티패턴
+- **Server 가 API key 그대로 client 전달**: leak. ephemeral key.
+- **Interruption 처리 안 함**: 어색한 대화.
+- **VAD threshold 너무 민감**: 자기 응답 끊음.
+- **Long instructions 매 turn**: latency 증가. session 한 번만.
+- **Tool 실행 동기 — 5초 hang**: 사용자 침묵. 즉시 ack + result.
+- **Audio output 끝나기 전 다음 받음**: 겹침.
+- **Cost 모니터링 없음**: 통화 1시간 = $20+.
+
+## 🤖 LLM 활용 힌트
+- WebRTC > WebSocket latency.
+- Server VAD + interrupt + ephemeral key 3종.
+- Pipecat 가 production framework.
+
+## 🔗 관련 문서
+- [[AI_Multimodal_Vision_Patterns]]
+- [[AI_Streaming_LLM_Response]]
+- [[Backend_WebSocket_Scaling]]