Files
2nd/10_Wiki/Topics/AI_and_ML/Real-time-Operation.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.5 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-real-time-operation Real-time Operation 10_Wiki/Topics verified self
Real-time Systems
RTOS
Real-time Inference
none A 0.9 applied
real-time
latency
rtos
streaming
inference
2026-05-10 pending
language framework
python vllm

Real-time Operation

매 한 줄

"매 deadline 의 miss 의 failure". Real-time 의 fast 와 X — predictable latency budget 의 within. Hard RT (RTOS, avionics) 의 missed deadline 의 catastrophic; soft RT (video, LLM streaming) 의 degraded UX.

매 핵심

매 분류

  • Hard RT: 매 deadline 의 absolute (pacemaker, ABS brake). RTOS — VxWorks, QNX, Zephyr.
  • Firm RT: 매 occasional miss 의 OK but useless after deadline (live video frame).
  • Soft RT: 매 best-effort, degraded quality on miss (LLM token stream, web UI).

매 Latency budgets

  • HFT: <10μs.
  • Game frame (60fps): 16.6ms.
  • VR frame (90fps): 11ms (motion-to-photon <20ms).
  • Web TTI: <200ms perceived instant.
  • LLM TTFT: <500ms (Claude Opus 4.7 streaming).
  • LLM inter-token: <50ms (20 tok/s minimum readable).

매 Web real-time

  • SSE: 매 server-push, HTTP/1.1 + 2, simple. LLM streaming default.
  • WebSocket: bidirectional, binary OK. Chat, multiplayer.
  • WebRTC: 매 P2P, sub-100ms voice/video.
  • HTTP/3 + WebTransport: 매 2026 emerging — UDP-based, multiplexed.

매 AI Real-time inference

  • vLLM: PagedAttention — 매 24x throughput vs naive.
  • MLX (Apple Silicon): M3/M4 의 unified memory — Llama 3.x 70B 의 local realtime.
  • Speculative decoding: small draft model 의 2-3x speedup.
  • KV cache: 매 prefix sharing — system prompt 의 cache.
  • Prompt caching (Anthropic): 매 90% cost cut, lower TTFT.

매 응용

  1. LLM chat 의 streaming token-by-token.
  2. Video conferencing (WebRTC).
  3. Trading systems (kdb+, FPGA).
  4. Robotics control loop (ROS 2 + Zephyr).
  5. Live captioning (Whisper streaming).

💻 패턴

LLM streaming with prompt cache

from anthropic import Anthropic

client = Anthropic()

with client.messages.stream(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[{
        "type": "text",
        "text": LARGE_SYSTEM_PROMPT,  # 10k+ tokens
        "cache_control": {"type": "ephemeral"},
    }],
    messages=[{"role": "user", "content": "..."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

SSE in FastAPI

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

async def event_stream():
    for i in range(100):
        yield f"data: token {i}\n\n"
        await asyncio.sleep(0.05)

@app.get("/stream")
async def stream():
    return StreamingResponse(event_stream(), media_type="text/event-stream")

vLLM batched inference server

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.3-70B", tensor_parallel_size=4,
          enable_prefix_caching=True)
params = SamplingParams(max_tokens=512, temperature=0.7)

# Continuous batching — 매 새 request 의 mid-batch 의 join.
outputs = llm.generate(prompts, params)

Game loop (fixed timestep)

const DT: f32 = 1.0 / 60.0;
let mut acc = 0.0;
let mut last = Instant::now();

loop {
    let now = Instant::now();
    acc += (now - last).as_secs_f32();
    last = now;
    while acc >= DT {
        physics_step(DT);
        acc -= DT;
    }
    render(acc / DT); // interpolate
}

RTOS task (Zephyr)

K_THREAD_DEFINE(ctrl_tid, 1024, control_loop, NULL, NULL, NULL,
                K_PRIO_PREEMPT(2), 0, 0);

void control_loop(void *p1, void *p2, void *p3) {
    while (1) {
        read_sensors();
        compute_pid();
        actuate();
        k_sleep(K_MSEC(10));  // 100Hz hard deadline
    }
}

매 결정 기준

상황 Approach
Safety-critical (medical, auto) Hard RT — RTOS, formal verification
LLM chat SSE streaming + prompt cache
Multiplayer game UDP + WebRTC / custom protocol
Voice/video call WebRTC
HFT Kernel bypass (DPDK), FPGA
Robotics ROS 2 + Zephyr/PREEMPT_RT Linux

기본값: SSE + Anthropic streaming for LLM, WebSocket for bidirectional chat.

🔗 Graph

🤖 LLM 활용

언제: 매 user-facing chat (TTFT < 500ms), 매 long-output (token streaming UX), tool-use loops. 언제 X: batch processing (use Batch API — 50% cheaper), embeddings (single-shot), latency-insensitive analytics.

안티패턴

  • Block on full response: 매 user 의 spinner 의 30s — 매 stream 의 use.
  • Soft RT 의 hard guarantees claim: Linux + GC 의 hard RT X.
  • No timeout: hung connection 의 leak — httpx.Timeout(30.0, connect=5.0).
  • No backpressure: producer 의 consumer 의 outpace → OOM.
  • Synchronous in event loop: time.sleep 의 asyncio 의 block.

🧪 검증 / 중복

  • Verified (vLLM docs, Anthropic streaming API, WebRTC RFC, Zephyr docs).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — RT systems + web streaming + LLM inference unified