--- id: wiki-2026-0508-voice-assistant-architecture title: Voice Assistant Architecture category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Voice AI Architecture, Conversational AI Pipeline, ASR-LLM-TTS Stack] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [voice-ai, asr, tts, llm, architecture, real-time] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: LiveKit/Pipecat/OpenAI-Realtime --- # Voice Assistant Architecture ## 매 한 줄 > **"매 voice assistant 의 핵심 = 'ASR → LLM → TTS pipeline 의 sub-500ms latency 의 streaming end-to-end'."**. 2023 Whisper / GPT-4 era 의 cascaded pipeline 의 mainstream → 2026 Realtime API era 매 native speech-to-speech model (GPT-4o Realtime, Gemini 2 Live) 의 emerge — 매 latency 의 ~300ms 의 reduce 의 cascade 의 obsolete 의 begin. 매 production 의 hybrid pipelines 의 still dominant 의 control / cost reasons. ## 매 핵심 ### 매 Cascaded pipeline (classical) - **VAD (Voice Activity Detection)**: Silero / WebRTC VAD — speech 의 boundary 의 detect. - **ASR**: Whisper Large v3 / Deepgram Nova-3 / AssemblyAI Universal-2 — streaming partial transcripts. - **LLM**: Claude Opus 4.7 / GPT-5 / Llama 3.3 — reasoning + persona. - **TTS**: ElevenLabs Flash v2 / Cartesia Sonic / OpenAI tts-1-hd — streaming audio chunks. - **Turn-taking logic**: end-of-turn detection, interruption handling, barge-in. ### 매 Speech-to-Speech (S2S) 모델 - **GPT-4o Realtime**: WebSocket, ~320ms first-byte audio. - **Gemini 2 Live API**: WebRTC, native multimodal. - **Moshi (Kyutai)**: open-source full-duplex, ~200ms. - 매 advantage: emotion / prosody / pause 의 preserve. 매 disadvantage: tool-calling / structured output 의 weaker. ### 매 Production patterns 1. **Edge VAD + cloud ASR**: 매 unnecessary upload 의 cut. 2. **Streaming everywhere**: ASR partial → LLM partial prompt → TTS chunk-by-chunk. 3. **Interruption handling**: 매 user 의 speak 의 detect → LLM stream 의 cancel + TTS audio 의 stop. 4. **Function calling layer**: tool 의 invoke 의 structured (calendar, search, IoT). ## 💻 패턴 ### Pattern 1: LiveKit Agents (full pipeline, 2026 standard) ```python from livekit.agents import AgentSession, Agent from livekit.plugins import openai, deepgram, cartesia, silero async def entrypoint(ctx): session = AgentSession( vad=silero.VAD.load(), stt=deepgram.STT(model="nova-3"), llm=openai.LLM.with_anthropic(model="claude-opus-4-7"), tts=cartesia.TTS(model="sonic-2", voice="warm-female"), ) agent = Agent(instructions="You are a helpful kitchen timer assistant.") await session.start(agent=agent, room=ctx.room) await session.generate_reply(instructions="Greet the user warmly.") ``` ### Pattern 2: OpenAI Realtime API (S2S WebSocket) ```python import asyncio, websockets, json, base64 async def main(): async with websockets.connect( "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2026", extra_headers={"Authorization": f"Bearer {API_KEY}", "OpenAI-Beta": "realtime=v1"}, ) as ws: await ws.send(json.dumps({ "type": "session.update", "session": { "voice": "alloy", "turn_detection": {"type": "server_vad"}, "tools": [{"type": "function", "name": "get_weather", ...}], }, })) async for msg in ws: event = json.loads(msg) if event["type"] == "response.audio.delta": play_audio(base64.b64decode(event["delta"])) ``` ### Pattern 3: Pipecat custom pipeline ```python from pipecat.pipeline import Pipeline from pipecat.services import AnthropicLLM, ElevenLabsTTS, DeepgramSTT from pipecat.transports import DailyTransport pipeline = Pipeline([ DailyTransport(room_url=URL).input(), DeepgramSTT(api_key=DG_KEY, model="nova-3"), AnthropicLLM(api_key=ANTH_KEY, model="claude-opus-4-7"), ElevenLabsTTS(api_key=EL_KEY, voice_id="..."), DailyTransport(room_url=URL).output(), ]) await pipeline.run() ``` ### Pattern 4: Interruption handling ```python class InterruptManager: def __init__(self): self.current_response = None async def on_user_speech_started(self): if self.current_response: self.current_response.cancel() # cancel LLM stream await self.tts.stop() # stop audio playback await self.tts.flush_buffer() async def on_user_speech_ended(self, transcript: str): self.current_response = asyncio.create_task(self.respond(transcript)) ``` ### Pattern 5: Function calling 의 mid-conversation ```python tools = [{ "name": "set_timer", "description": "Set a kitchen timer", "input_schema": {"type": "object", "properties": {"minutes": {"type": "integer"}}}, }] async def handle_tool_call(name, args): if name == "set_timer": timer = Timer(minutes=args["minutes"]) timer.start() return f"Timer set for {args['minutes']} minutes" # LLM streams tool_use → execute → feed result back → continue speaking ``` ### Pattern 6: VAD-gated ASR (cost saving) ```python import silero_vad vad = silero_vad.load_silero_vad() async def stream_audio(audio_iter): buffer = [] for chunk in audio_iter: speech_prob = vad(chunk, 16000) if speech_prob > 0.5: buffer.append(chunk) elif buffer: # End of utterance — send buffered audio to ASR await asr.transcribe(b"".join(buffer)) buffer = [] ``` ### Pattern 7: Latency budget breakdown (target <800ms total) ```text User speaks → VAD endpoint detect: 150ms ASR partial → final transcript: 100ms (streaming, parallel) LLM TTFT (time-to-first-token): 250ms TTS first audio chunk: 150ms Network + jitter buffer: 150ms Total perceived latency: ~800ms ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Lowest latency / natural prosody | OpenAI Realtime / Gemini Live (S2S) | | Tool-heavy / structured | Cascaded (Claude/GPT-5 + LiveKit Agents) | | Open-source / on-prem | Whisper + Llama 3.3 + XTTS-v2 | | Phone / telephony | Twilio + LiveKit Agents | | Browser-only client | Web Speech API + WebRTC (limited) | | Multilingual | Deepgram Nova-3 + Cartesia Sonic | **기본값**: LiveKit Agents 의 cascaded pipeline (Deepgram + Claude/GPT-5 + Cartesia) for production flexibility. ## 🔗 Graph - 부모: [[Real-Time-Systems]] - Adjacent: [[Whisper]] · [[ElevenLabs]] · [[LiveKit]] · [[WebRTC]] ## 🤖 LLM 활용 **언제**: pipeline component 의 selection / latency 의 budget 의 calculation / interruption logic 의 design. **언제 X**: 매 audio quality 의 subjective evaluation — 매 human listening test 의 필요. ## ❌ 안티패턴 - **No streaming**: full transcript 의 wait → 매 multi-second latency. - **VAD 의 missing**: 매 silence 의 ASR 의 send → cost + latency. - **No interrupt handling**: 매 user 의 speak 의 의 assistant 의 talk over. - **Synchronous tool calls**: 매 long tool 의 block 의 audio response — 매 ack message ("checking...") + parallel. - **Over-engineered S2S**: 매 simple Q&A 의 Realtime API 의 use → cost 5-10x without benefit. - **No turn-detection tuning**: 매 default endpointing 의 cut user mid-sentence. ## 🧪 검증 / 중복 - Verified: LiveKit Agents docs (2026), OpenAI Realtime API docs, Pipecat documentation, "Building Voice Agents" (Anthropic cookbook). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — full voice assistant architecture guide |