---
id: ai-multimodal-vision-patterns
title: Multimodal — 이미지 / 음성 / 비디오 LLM
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [ai, multimodal, vision, audio, vibe-coding]
tech_stack: { language: "TS / OpenAI / Anthropic / Gemini", applicable_to: ["Backend"] }
applied_in: []
aliases: [vision, image input, OCR, Whisper, audio, video, multimodal LLM]
---

# Multimodal LLM

> Text 만 아님 — **image / audio / video 입력** 가능. Vision 으로 OCR / 차트 분석 / UI 검사. Whisper 로 STT. Gemini 가 native 비디오. 입력 크기 제한 + 비용 차이.

## 📖 핵심 개념
- Vision: 이미지 → text understanding.
- STT (Speech-to-Text): Whisper / Deepgram.
- TTS (Text-to-Speech): OpenAI / ElevenLabs.
- Video: Gemini 1.5/2 / Twelve Labs.
- Token cost: 이미지 = 픽셀 기반 token.

## 💻 코드 패턴

### Anthropic Vision
```ts
const r = await anthropic.messages.create({
  model: 'claude-opus-4-7',
  max_tokens: 1024,
  messages: [{
    role: 'user', content: [
      { type: 'image', source: { type: 'base64', media_type: 'image/png', data: base64 } },
      { type: 'text', text: 'Extract all text from this receipt.' },
    ],
  }],
});
```

URL 직접 (일부 제공자):
```ts
{ type: 'image', source: { type: 'url', url: 'https://...' } }
```

### OpenAI Vision (gpt-4o)
```ts
const r = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user', content: [
      { type: 'image_url', image_url: { url: 'data:image/png;base64,...', detail: 'high' } },
      { type: 'text', text: 'Describe the chart.' },
    ],
  }],
});
```

`detail`: low (적은 token) / high (정확).

### OCR vs vision LLM
- 단순 영수증 / 명함: Tesseract / AWS Textract / Google Vision API (싸고 빠름).
- 차트 해석 / 표 + 의미: Vision LLM.

### Whisper (STT)
```ts
const r = await openai.audio.transcriptions.create({
  file: fs.createReadStream('audio.mp3'),
  model: 'whisper-1',
  language: 'ko',
  response_format: 'verbose_json',
  timestamp_granularities: ['segment', 'word'],
});

console.log(r.text);
console.log(r.segments); // [{ start, end, text }]
```

### TTS
```ts
const r = await openai.audio.speech.create({
  model: 'tts-1-hd',
  voice: 'alloy', // alloy / echo / fable / onyx / nova / shimmer
  input: 'Hello world',
  response_format: 'mp3',
});
const buf = Buffer.from(await r.arrayBuffer());
fs.writeFileSync('out.mp3', buf);
```

### Streaming TTS (real-time)
```ts
import { OpenAI } from 'openai';
const stream = await openai.audio.speech.create({
  model: 'tts-1', voice: 'nova', input: text, response_format: 'opus',
});
// chunk 로 stream 재생
```

### ElevenLabs (사람 같은 음성)
```ts
import { ElevenLabs } from 'elevenlabs';
const client = new ElevenLabs({ apiKey });
const stream = await client.textToSpeech.convertAsStream('voice-id', {
  text, modelId: 'eleven_turbo_v2_5',
});
```

### Realtime API (OpenAI / Anthropic)
- 사용자 음성 → 즉시 응답 음성.
- WebRTC 또는 WebSocket.
- 대화형 voice agent.

```ts
const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview', {
  headers: { Authorization: `Bearer ${apiKey}`, 'OpenAI-Beta': 'realtime=v1' },
});

ws.on('message', (msg) => {
  const ev = JSON.parse(msg);
  if (ev.type === 'response.audio.delta') {
    const audio = Buffer.from(ev.delta, 'base64');
    speaker.write(audio);
  }
});
```

### Gemini Video
```ts
import { GoogleGenerativeAI } from '@google/generative-ai';
const genAI = new GoogleGenerativeAI(apiKey);
const model = genAI.getGenerativeModel({ model: 'gemini-1.5-pro' });

const file = await fileManager.uploadFile('video.mp4', { mimeType: 'video/mp4' });
const r = await model.generateContent([
  { fileData: { fileUri: file.uri, mimeType: file.mimeType } },
  'Summarize this video.',
]);
```

### Image generation (DALL-E / Imagen / Stable Diffusion)
```ts
const r = await openai.images.generate({
  model: 'dall-e-3',
  prompt: 'A red cat in space',
  size: '1024x1024',
  quality: 'hd',
});
const url = r.data[0].url;
```

### 비용 절감
- Vision detail: low / auto / high.
- 이미지 압축 (1024px 충분).
- Cache: 같은 이미지 hash → 결과 cache.

## 🤔 의사결정 기준
| 입력 | 추천 |
|---|---|
| 영수증 OCR | Vision LLM (Claude / GPT-4o) 또는 Textract |
| 차트 해석 | Vision LLM |
| 사용자 음성 transcribe | Whisper / Deepgram (실시간) |
| 자연 음성 출력 | ElevenLabs / OpenAI TTS |
| 음성 대화 agent | OpenAI Realtime / Pipecat |
| 비디오 분석 | Gemini 1.5+ |
| 이미지 생성 | DALL-E / Flux / Imagen |

## ❌ 안티패턴
- **큰 이미지 base64**: token 비용. resize.
- **모든 이미지 detail high**: low / auto 충분 자주.
- **PII 음성 그대로 외부 API**: privacy. on-prem Whisper.
- **응답 stream X 음성 대화**: latency 5s+ — 비실시간.
- **동영상 통째 input**: 1분당 token 폭발. 키 frame 추출.
- **OCR 결과 그대로 신뢰**: 재검토 / structured output.
- **Base 인코딩 큰 파일 메모리**: stream / multipart.

## 🤖 LLM 활용 힌트
- Vision = base64 / URL.
- Whisper = STT 표준.
- Realtime API 가 voice agent 의 미래.
- 비용 = detail / 차원 / cache.

## 🔗 관련 문서
- [[AI_Function_Calling_Deep]]
- [[AI_Streaming_LLM_Response]]
- [[Frontend_Image_Optimization]]