"매 voice assistant 의 핵심 = 'ASR → LLM → TTS pipeline 의 sub-500ms latency 의 streaming end-to-end'.". 2023 Whisper / GPT-4 era 의 cascaded pipeline 의 mainstream → 2026 Realtime API era 매 native speech-to-speech model (GPT-4o Realtime, Gemini 2 Live) 의 emerge — 매 latency 의 ~300ms 의 reduce 의 cascade 의 obsolete 의 begin. 매 production 의 hybrid pipelines 의 still dominant 의 control / cost reasons.
매 핵심
매 Cascaded pipeline (classical)
VAD (Voice Activity Detection): Silero / WebRTC VAD — speech 의 boundary 의 detect.
fromlivekit.agentsimportAgentSession,Agentfromlivekit.pluginsimportopenai,deepgram,cartesia,sileroasyncdefentrypoint(ctx):session=AgentSession(vad=silero.VAD.load(),stt=deepgram.STT(model="nova-3"),llm=openai.LLM.with_anthropic(model="claude-opus-4-7"),tts=cartesia.TTS(model="sonic-2",voice="warm-female"),)agent=Agent(instructions="You are a helpful kitchen timer assistant.")awaitsession.start(agent=agent,room=ctx.room)awaitsession.generate_reply(instructions="Greet the user warmly.")
tools=[{"name":"set_timer","description":"Set a kitchen timer","input_schema":{"type":"object","properties":{"minutes":{"type":"integer"}}},}]asyncdefhandle_tool_call(name,args):ifname=="set_timer":timer=Timer(minutes=args["minutes"])timer.start()returnf"Timer set for {args['minutes']} minutes"# LLM streams tool_use → execute → feed result back → continue speaking
Pattern 6: VAD-gated ASR (cost saving)
importsilero_vadvad=silero_vad.load_silero_vad()asyncdefstream_audio(audio_iter):buffer=[]forchunkinaudio_iter:speech_prob=vad(chunk,16000)ifspeech_prob>0.5:buffer.append(chunk)elifbuffer:# End of utterance — send buffered audio to ASRawaitasr.transcribe(b"".join(buffer))buffer=[]
언제: pipeline component 의 selection / latency 의 budget 의 calculation / interruption logic 의 design.
언제 X: 매 audio quality 의 subjective evaluation — 매 human listening test 의 필요.
❌ 안티패턴
No streaming: full transcript 의 wait → 매 multi-second latency.
VAD 의 missing: 매 silence 의 ASR 의 send → cost + latency.
No interrupt handling: 매 user 의 speak 의 의 assistant 의 talk over.
Synchronous tool calls: 매 long tool 의 block 의 audio response — 매 ack message ("checking...") + parallel.
Over-engineered S2S: 매 simple Q&A 의 Realtime API 의 use → cost 5-10x without benefit.
No turn-detection tuning: 매 default endpointing 의 cut user mid-sentence.