Files
2nd/10_Wiki/Topics/Coding/AI_Voice_Agent_Realtime.md
T
2026-05-09 21:08:02 +09:00

6.2 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-voice-agent-realtime Voice Agent — Realtime API / 양방향 음성 Coding draft B conceptual 2026-05-09 2026-05-09
ai
voice
realtime
vibe-coding
language applicable_to
TS / WebRTC / WebSocket
Backend
Frontend
voice agent
OpenAI Realtime
Pipecat
LiveKit
VAD
interruption

Voice Agent

사용자 말 → LLM 응답 → 음성. OpenAI Realtime API / Pipecat / LiveKit Agents 가 표준. Latency 가 핵심 (<500ms feel natural). VAD + interruption + back-channel.

📖 핵심 개념

  • VAD (Voice Activity Detection): 사용자가 말하는지.
  • Turn-taking: 말 끝 인식.
  • Interruption: 사용자가 끼어들기 → 모델 멈춤.
  • Latency budget: 음성 → text → LLM → text → 음성 = 보통 <1s.

💻 코드 패턴

OpenAI Realtime (WebSocket)

import WebSocket from 'ws';

const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', {
  headers: {
    'Authorization': `Bearer ${apiKey}`,
    'OpenAI-Beta': 'realtime=v1',
  },
});

ws.on('open', () => {
  ws.send(JSON.stringify({
    type: 'session.update',
    session: {
      modalities: ['text', 'audio'],
      instructions: 'You are a helpful voice assistant. Be concise.',
      voice: 'alloy',
      input_audio_format: 'pcm16',
      output_audio_format: 'pcm16',
      input_audio_transcription: { model: 'whisper-1' },
      turn_detection: { type: 'server_vad', threshold: 0.5, silence_duration_ms: 500 },
      tools: [{
        type: 'function', name: 'search', description: '...', parameters: {...},
      }],
    },
  }));
});

// 사용자 음성 chunk → 보내기
audioInput.on('data', (pcm) => {
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: pcm.toString('base64'),
  }));
});

// 응답 받기
ws.on('message', (msg) => {
  const ev = JSON.parse(msg.toString());
  if (ev.type === 'response.audio.delta') {
    const audio = Buffer.from(ev.delta, 'base64');
    speaker.write(audio);
  }
  if (ev.type === 'response.function_call_arguments.done') {
    handleFunctionCall(ev);
  }
});

WebRTC (browser, 더 좋은 latency)

const pc = new RTCPeerConnection();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
stream.getTracks().forEach(t => pc.addTrack(t, stream));

const remoteAudio = new Audio();
pc.ontrack = (e) => { remoteAudio.srcObject = e.streams[0]; remoteAudio.play(); };

const offer = await pc.createOffer();
await pc.setLocalDescription(offer);

// SDP 를 OpenAI Realtime 에 보내고 answer 받음
const r = await fetch('https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', {
  method: 'POST', headers: { Authorization: `Bearer ${ephemeralKey}`, 'Content-Type': 'application/sdp' },
  body: offer.sdp,
});
const answer = await r.text();
await pc.setRemoteDescription({ type: 'answer', sdp: answer });

Ephemeral key (클라용 짧은 token)

// 서버
const r = await fetch('https://api.openai.com/v1/realtime/sessions', {
  method: 'POST',
  headers: { Authorization: `Bearer ${apiKey}` },
  body: JSON.stringify({ model: 'gpt-4o-realtime-preview' }),
});
const { client_secret } = await r.json();
// client_secret.value → 클라에 send (1분 valid)

Pipecat (Python framework)

from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.services.openai_realtime_beta import OpenAIRealtimeBetaLLMService
from pipecat.transports.network.websocket_server import WebsocketServerTransport

llm = OpenAIRealtimeBetaLLMService(api_key=API_KEY)
pipeline = Pipeline([transport.input(), llm, transport.output()])
runner = PipelineRunner()
await runner.run(pipeline)

Interruption

// 사용자가 말 시작 → 모델 응답 cancel
ws.on('message', (msg) => {
  const ev = JSON.parse(msg.toString());
  if (ev.type === 'input_audio_buffer.speech_started') {
    ws.send(JSON.stringify({ type: 'response.cancel' }));
  }
});

Tool use

{
  type: 'response.function_call_arguments.done',
  call_id: '...',
  arguments: '{"query":"weather"}'
}

// 실행 후
ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'function_call_output',
    call_id: '...',
    output: JSON.stringify(result),
  },
}));
ws.send(JSON.stringify({ type: 'response.create' }));

음질 / latency 팁

  • 16kHz PCM mono.
  • Echo cancellation (browser native).
  • Server VAD vs client VAD — 환경별.
  • WebRTC > WebSocket (latency).
  • Background music / noise → suppression.

비용

Audio input:  $0.10 / minute (대략)
Audio output: $0.20 / minute
→ 5분 통화 = $1.50

LLM-only (Whisper + GPT + TTS) 가 더 싼 경우도 — latency trade.

LiveKit Agents (alternative)

from livekit.agents import AutoSubscribe, JobContext, llm
from livekit.plugins import openai, silero

@agent
async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
    agent = VoiceAssistant(
        vad=silero.VAD.load(),
        stt=openai.STT(),
        llm=openai.LLM(),
        tts=openai.TTS(),
    )
    agent.start(ctx.room)

🤔 의사결정 기준

상황 추천
빠른 prototype OpenAI Realtime
Production 강력 framework Pipecat / LiveKit Agents
저비용 / 자체 stack Whisper + LLM + ElevenLabs TTS
전화 통합 Twilio + Pipecat
매우 low latency (gaming) 자체 stack + edge

안티패턴

  • Server 가 API key 그대로 client 전달: leak. ephemeral key.
  • Interruption 처리 안 함: 어색한 대화.
  • VAD threshold 너무 민감: 자기 응답 끊음.
  • Long instructions 매 turn: latency 증가. session 한 번만.
  • Tool 실행 동기 — 5초 hang: 사용자 침묵. 즉시 ack + result.
  • Audio output 끝나기 전 다음 받음: 겹침.
  • Cost 모니터링 없음: 통화 1시간 = $20+.

🤖 LLM 활용 힌트

  • WebRTC > WebSocket latency.
  • Server VAD + interrupt + ephemeral key 3종.
  • Pipecat 가 production framework.

🔗 관련 문서