--- id: wiki-2026-0508-real-time-operation title: Real-time Operation category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Real-time Systems, RTOS, Real-time Inference] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [real-time, latency, rtos, streaming, inference] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: vllm --- # Real-time Operation ## 매 한 줄 > **"매 deadline 의 miss 의 failure"**. Real-time 의 fast 와 X — predictable latency budget 의 within. Hard RT (RTOS, avionics) 의 missed deadline 의 catastrophic; soft RT (video, LLM streaming) 의 degraded UX. ## 매 핵심 ### 매 분류 - **Hard RT**: 매 deadline 의 absolute (pacemaker, ABS brake). RTOS — VxWorks, QNX, Zephyr. - **Firm RT**: 매 occasional miss 의 OK but useless after deadline (live video frame). - **Soft RT**: 매 best-effort, degraded quality on miss (LLM token stream, web UI). ### 매 Latency budgets - **HFT**: <10μs. - **Game frame (60fps)**: 16.6ms. - **VR frame (90fps)**: 11ms (motion-to-photon <20ms). - **Web TTI**: <200ms perceived instant. - **LLM TTFT**: <500ms (Claude Opus 4.7 streaming). - **LLM inter-token**: <50ms (20 tok/s minimum readable). ### 매 Web real-time - **SSE**: 매 server-push, HTTP/1.1 + 2, simple. LLM streaming default. - **WebSocket**: bidirectional, binary OK. Chat, multiplayer. - **WebRTC**: 매 P2P, sub-100ms voice/video. - **HTTP/3 + WebTransport**: 매 2026 emerging — UDP-based, multiplexed. ### 매 AI Real-time inference - **vLLM**: PagedAttention — 매 24x throughput vs naive. - **MLX (Apple Silicon)**: M3/M4 의 unified memory — Llama 3.x 70B 의 local realtime. - **Speculative decoding**: small draft model 의 2-3x speedup. - **KV cache**: 매 prefix sharing — system prompt 의 cache. - **Prompt caching (Anthropic)**: 매 90% cost cut, lower TTFT. ### 매 응용 1. LLM chat 의 streaming token-by-token. 2. Video conferencing (WebRTC). 3. Trading systems (kdb+, FPGA). 4. Robotics control loop (ROS 2 + Zephyr). 5. Live captioning (Whisper streaming). ## 💻 패턴 ### LLM streaming with prompt cache ```python from anthropic import Anthropic client = Anthropic() with client.messages.stream( model="claude-opus-4-7", max_tokens=2048, system=[{ "type": "text", "text": LARGE_SYSTEM_PROMPT, # 10k+ tokens "cache_control": {"type": "ephemeral"}, }], messages=[{"role": "user", "content": "..."}], ) as stream: for text in stream.text_stream: print(text, end="", flush=True) ``` ### SSE in FastAPI ```python from fastapi import FastAPI from fastapi.responses import StreamingResponse import asyncio app = FastAPI() async def event_stream(): for i in range(100): yield f"data: token {i}\n\n" await asyncio.sleep(0.05) @app.get("/stream") async def stream(): return StreamingResponse(event_stream(), media_type="text/event-stream") ``` ### vLLM batched inference server ```python from vllm import LLM, SamplingParams llm = LLM(model="meta-llama/Llama-3.3-70B", tensor_parallel_size=4, enable_prefix_caching=True) params = SamplingParams(max_tokens=512, temperature=0.7) # Continuous batching — 매 새 request 의 mid-batch 의 join. outputs = llm.generate(prompts, params) ``` ### Game loop (fixed timestep) ```rust const DT: f32 = 1.0 / 60.0; let mut acc = 0.0; let mut last = Instant::now(); loop { let now = Instant::now(); acc += (now - last).as_secs_f32(); last = now; while acc >= DT { physics_step(DT); acc -= DT; } render(acc / DT); // interpolate } ``` ### RTOS task (Zephyr) ```c K_THREAD_DEFINE(ctrl_tid, 1024, control_loop, NULL, NULL, NULL, K_PRIO_PREEMPT(2), 0, 0); void control_loop(void *p1, void *p2, void *p3) { while (1) { read_sensors(); compute_pid(); actuate(); k_sleep(K_MSEC(10)); // 100Hz hard deadline } } ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Safety-critical (medical, auto) | Hard RT — RTOS, formal verification | | LLM chat | SSE streaming + prompt cache | | Multiplayer game | UDP + WebRTC / custom protocol | | Voice/video call | WebRTC | | HFT | Kernel bypass (DPDK), FPGA | | Robotics | ROS 2 + Zephyr/PREEMPT_RT Linux | **기본값**: SSE + Anthropic streaming for LLM, WebSocket for bidirectional chat. ## 🔗 Graph - 부모: [[Distributed-Systems]] - 변형: [[Streaming]] - 응용: [[WebRTC]] · [[Game-Loop]] - Adjacent: [[Latency-Optimization]] · [[LLM_Optimization_and_Deployment_Strategies|vLLM]] ## 🤖 LLM 활용 **언제**: 매 user-facing chat (TTFT < 500ms), 매 long-output (token streaming UX), tool-use loops. **언제 X**: batch processing (use Batch API — 50% cheaper), embeddings (single-shot), latency-insensitive analytics. ## ❌ 안티패턴 - **Block on full response**: 매 user 의 spinner 의 30s — 매 stream 의 use. - **Soft RT 의 hard guarantees claim**: Linux + GC 의 hard RT X. - **No timeout**: hung connection 의 leak — `httpx.Timeout(30.0, connect=5.0)`. - **No backpressure**: producer 의 consumer 의 outpace → OOM. - **Synchronous in event loop**: `time.sleep` 의 asyncio 의 block. ## 🧪 검증 / 중복 - Verified (vLLM docs, Anthropic streaming API, WebRTC RFC, Zephyr docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — RT systems + web streaming + LLM inference unified |