--- id: ai-voice-agent-realtime title: Voice Agent — Realtime API / 양방향 음성 category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, voice, realtime, vibe-coding] tech_stack: { language: "TS / WebRTC / WebSocket", applicable_to: ["Backend", "Frontend"] } applied_in: [] aliases: [voice agent, OpenAI Realtime, Pipecat, LiveKit, VAD, interruption] --- # Voice Agent > 사용자 말 → LLM 응답 → 음성. **OpenAI Realtime API / Pipecat / LiveKit Agents** 가 표준. **Latency 가 핵심** (<500ms feel natural). VAD + interruption + back-channel. ## 📖 핵심 개념 - VAD (Voice Activity Detection): 사용자가 말하는지. - Turn-taking: 말 끝 인식. - Interruption: 사용자가 끼어들기 → 모델 멈춤. - Latency budget: 음성 → text → LLM → text → 음성 = 보통 <1s. ## 💻 코드 패턴 ### OpenAI Realtime (WebSocket) ```ts import WebSocket from 'ws'; const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', { headers: { 'Authorization': `Bearer ${apiKey}`, 'OpenAI-Beta': 'realtime=v1', }, }); ws.on('open', () => { ws.send(JSON.stringify({ type: 'session.update', session: { modalities: ['text', 'audio'], instructions: 'You are a helpful voice assistant. Be concise.', voice: 'alloy', input_audio_format: 'pcm16', output_audio_format: 'pcm16', input_audio_transcription: { model: 'whisper-1' }, turn_detection: { type: 'server_vad', threshold: 0.5, silence_duration_ms: 500 }, tools: [{ type: 'function', name: 'search', description: '...', parameters: {...}, }], }, })); }); // 사용자 음성 chunk → 보내기 audioInput.on('data', (pcm) => { ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: pcm.toString('base64'), })); }); // 응답 받기 ws.on('message', (msg) => { const ev = JSON.parse(msg.toString()); if (ev.type === 'response.audio.delta') { const audio = Buffer.from(ev.delta, 'base64'); speaker.write(audio); } if (ev.type === 'response.function_call_arguments.done') { handleFunctionCall(ev); } }); ``` ### WebRTC (browser, 더 좋은 latency) ```ts const pc = new RTCPeerConnection(); const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); stream.getTracks().forEach(t => pc.addTrack(t, stream)); const remoteAudio = new Audio(); pc.ontrack = (e) => { remoteAudio.srcObject = e.streams[0]; remoteAudio.play(); }; const offer = await pc.createOffer(); await pc.setLocalDescription(offer); // SDP 를 OpenAI Realtime 에 보내고 answer 받음 const r = await fetch('https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', { method: 'POST', headers: { Authorization: `Bearer ${ephemeralKey}`, 'Content-Type': 'application/sdp' }, body: offer.sdp, }); const answer = await r.text(); await pc.setRemoteDescription({ type: 'answer', sdp: answer }); ``` ### Ephemeral key (클라용 짧은 token) ```ts // 서버 const r = await fetch('https://api.openai.com/v1/realtime/sessions', { method: 'POST', headers: { Authorization: `Bearer ${apiKey}` }, body: JSON.stringify({ model: 'gpt-4o-realtime-preview' }), }); const { client_secret } = await r.json(); // client_secret.value → 클라에 send (1분 valid) ``` ### Pipecat (Python framework) ```python from pipecat.pipeline.pipeline import Pipeline from pipecat.pipeline.runner import PipelineRunner from pipecat.services.openai_realtime_beta import OpenAIRealtimeBetaLLMService from pipecat.transports.network.websocket_server import WebsocketServerTransport llm = OpenAIRealtimeBetaLLMService(api_key=API_KEY) pipeline = Pipeline([transport.input(), llm, transport.output()]) runner = PipelineRunner() await runner.run(pipeline) ``` ### Interruption ```ts // 사용자가 말 시작 → 모델 응답 cancel ws.on('message', (msg) => { const ev = JSON.parse(msg.toString()); if (ev.type === 'input_audio_buffer.speech_started') { ws.send(JSON.stringify({ type: 'response.cancel' })); } }); ``` ### Tool use ```ts { type: 'response.function_call_arguments.done', call_id: '...', arguments: '{"query":"weather"}' } // 실행 후 ws.send(JSON.stringify({ type: 'conversation.item.create', item: { type: 'function_call_output', call_id: '...', output: JSON.stringify(result), }, })); ws.send(JSON.stringify({ type: 'response.create' })); ``` ### 음질 / latency 팁 - 16kHz PCM mono. - Echo cancellation (browser native). - Server VAD vs client VAD — 환경별. - WebRTC > WebSocket (latency). - Background music / noise → suppression. ### 비용 ``` Audio input: $0.10 / minute (대략) Audio output: $0.20 / minute → 5분 통화 = $1.50 ``` LLM-only (Whisper + GPT + TTS) 가 더 싼 경우도 — latency trade. ### LiveKit Agents (alternative) ```python from livekit.agents import AutoSubscribe, JobContext, llm from livekit.plugins import openai, silero @agent async def entrypoint(ctx: JobContext): await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY) agent = VoiceAssistant( vad=silero.VAD.load(), stt=openai.STT(), llm=openai.LLM(), tts=openai.TTS(), ) agent.start(ctx.room) ``` ## 🤔 의사결정 기준 | 상황 | 추천 | |---|---| | 빠른 prototype | OpenAI Realtime | | Production 강력 framework | Pipecat / LiveKit Agents | | 저비용 / 자체 stack | Whisper + LLM + ElevenLabs TTS | | 전화 통합 | Twilio + Pipecat | | 매우 low latency (gaming) | 자체 stack + edge | ## ❌ 안티패턴 - **Server 가 API key 그대로 client 전달**: leak. ephemeral key. - **Interruption 처리 안 함**: 어색한 대화. - **VAD threshold 너무 민감**: 자기 응답 끊음. - **Long instructions 매 turn**: latency 증가. session 한 번만. - **Tool 실행 동기 — 5초 hang**: 사용자 침묵. 즉시 ack + result. - **Audio output 끝나기 전 다음 받음**: 겹침. - **Cost 모니터링 없음**: 통화 1시간 = $20+. ## 🤖 LLM 활용 힌트 - WebRTC > WebSocket latency. - Server VAD + interrupt + ephemeral key 3종. - Pipecat 가 production framework. ## 🔗 관련 문서 - [[AI_Multimodal_Vision_Patterns]] - [[AI_Streaming_LLM_Response]] - [[Backend_WebSocket_Scaling]]