Files
2nd/10_Wiki/Topics/Coding/AI_Vision_Agents.md
T
2026-05-09 21:08:02 +09:00

286 lines
7.8 KiB
Markdown

---
id: ai-vision-agents
title: Vision Agents — 화면 / OCR / Browser 자동화
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [ai, vision, agent, automation, vibe-coding]
tech_stack: { language: "TS / Python", applicable_to: ["Backend"] }
applied_in: []
aliases: [computer use, browser agent, screen agent, OCR agent, GUI automation, Claude Computer Use]
---
# Vision Agents
> LLM 이 screenshot 보고 클릭 / 입력. **Anthropic Computer Use, OpenAI Operator, browser-use, Stagehand**. Action loop = screenshot → LLM → action → repeat.
## 📖 핵심 개념
- Screenshot / DOM 기반.
- Action: click(x, y), type(text), scroll, key.
- Browser: Playwright / Selenium 자동화.
- Desktop: 시스템 권한 + Accessibility.
## 💻 코드 패턴
### Anthropic Computer Use
```ts
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function computerUseLoop(task: string) {
const messages: any[] = [{ role: 'user', content: task }];
for (let i = 0; i < 30; i++) {
const r = await client.beta.messages.create({
model: 'claude-opus-4-7',
max_tokens: 1024,
tools: [{
type: 'computer_20250124',
name: 'computer',
display_width_px: 1280,
display_height_px: 800,
}],
messages,
betas: ['computer-use-2025-01-24'],
});
messages.push({ role: 'assistant', content: r.content });
if (r.stop_reason === 'end_turn') return r;
// tool_use blocks
const toolUses = r.content.filter(b => b.type === 'tool_use');
const toolResults = await Promise.all(toolUses.map(async (t) => {
const result = await executeAction(t.input);
return {
type: 'tool_result' as const,
tool_use_id: t.id,
content: result,
};
}));
messages.push({ role: 'user', content: toolResults });
}
}
async function executeAction(input: any) {
switch (input.action) {
case 'screenshot': {
const buf = await screenshot();
return [{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: buf.toString('base64') } }];
}
case 'left_click':
await clickAt(input.coordinate);
return [{ type: 'text', text: 'clicked' }];
case 'type':
await typeText(input.text);
return [{ type: 'text', text: 'typed' }];
case 'key':
await pressKey(input.text);
return [{ type: 'text', text: 'pressed' }];
case 'scroll':
await scroll(input.direction, input.amount);
return [{ type: 'text', text: 'scrolled' }];
}
}
```
### Browser agent (Playwright + Claude / GPT)
```ts
import { chromium } from 'playwright';
import Anthropic from '@anthropic-ai/sdk';
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
async function browserLoop(task: string) {
// ... agent loop with tools:
const tools = [
{
name: 'screenshot',
description: 'Take a screenshot of the page',
input_schema: { type: 'object', properties: {} },
},
{
name: 'click',
description: 'Click element by visible text or selector',
input_schema: {
type: 'object',
properties: { selector: { type: 'string' }, text: { type: 'string' } },
},
},
{
name: 'type',
description: 'Type into input',
input_schema: { type: 'object', properties: { selector: { type: 'string' }, text: { type: 'string' } } },
},
{
name: 'goto',
description: 'Navigate to URL',
input_schema: { type: 'object', properties: { url: { type: 'string' } } },
},
];
}
```
### Stagehand (Browserbase, modern)
```ts
import { Stagehand } from '@browserbasehq/stagehand';
const stagehand = new Stagehand({ env: 'LOCAL' });
await stagehand.init();
const page = stagehand.page;
await page.goto('https://docs.example.com');
// 자연어 action
await page.act('Click the "Get Started" button');
await page.act('Type "search query" into the search bar');
// 자연어 extract
const result = await page.extract({
instruction: 'extract the first 3 article titles',
schema: z.object({ titles: z.array(z.string()) }),
});
// 자연어 observe
const action = await page.observe({ instruction: 'Find the login button' });
```
→ 가장 단순한 production-ready browser agent.
### browser-use (Python, popular)
```python
from browser_use import Agent
from langchain_anthropic import ChatAnthropic
agent = Agent(
task='Find the cheapest flight from Seoul to Tokyo on May 15',
llm=ChatAnthropic(model='claude-opus-4-7'),
)
result = await agent.run()
```
### Set-of-Marks (SoM)
```
Screenshot 위에 click 가능 element 마다 번호 라벨.
LLM 이 "click element 7" 같이 말함.
→ Coordinate-based 보다 정확.
```
```ts
// Element 마다 박스 + 번호 그림
const labeled = await page.evaluate(() => {
const elements = document.querySelectorAll('a, button, input, [role="button"]');
return elements.map((el, i) => {
const rect = el.getBoundingClientRect();
return { idx: i, x: rect.x, y: rect.y, w: rect.width, h: rect.height };
});
});
// canvas 에 박스 + 숫자 → screenshot
```
### OCR agent (textract, paddle, tesseract)
```ts
// 이미지 → 텍스트
import Tesseract from 'tesseract.js';
const r = await Tesseract.recognize('document.png', 'eng+kor');
console.log(r.data.text);
// 또는 LLM vision (정확)
const r = await anthropic.messages.create({
model: 'claude-opus-4-7',
messages: [{ role: 'user', content: [
{ type: 'image', source: { ... } },
{ type: 'text', text: 'Extract all text. Output JSON with fields.' },
]}],
});
```
→ Receipt / form / table 처리.
### Desktop automation (cross-platform)
```ts
// Anthropic computer-use container
// 또는 nut-tree (Node) / pyautogui (Python)
import { mouse, keyboard, screen } from '@nut-tree-fork/nut-js';
await mouse.move(centerOf(await screen.find('button.png')));
await mouse.click(Button.LEFT);
```
### Anti-bot / detection
```
사이트가 bot 검출 → CAPTCHA / 차단.
대응:
- Playwright stealth plugin
- Browserbase / Anchor (cloud) — IP / fingerprint 처리
- 적절 delay / mouse movement
```
→ 사이트 ToS 확인.
### Cost
```
Computer Use: 매 turn screenshot + LLM call.
큰 task (100 step) = $5+.
```
→ Self-host LLM (Vision-capable) 또는 cache.
### Test
```
복잡 — 같은 화면이 매번 다를 수 있음.
- Mock browser
- Recorded scenarios
- Smoke test ("로그인" 같은 핵심 path)
```
### 안전
```ts
// 사용자 confirm dangerous
const dangerous = ['delete', 'pay', 'send'];
if (toolUse.input.action === 'left_click') {
const target = await getElementText(toolUse.input.coordinate);
if (dangerous.some(d => target.toLowerCase().includes(d))) {
const ok = await confirmWithUser(`Click "${target}"?`);
if (!ok) return { skipped: true };
}
}
```
## 🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| 일반 web 자동화 | Stagehand (modern) |
| 고급 / Open source | browser-use |
| Cloud-hosted browser | Browserbase + Stagehand |
| Desktop GUI | Anthropic Computer Use container |
| Form / receipt OCR | LLM Vision |
| Reliable existing flow | Playwright fixed script |
## ❌ 안티패턴
- **Coordinate hardcode**: viewport / 해상도 차이. text / selector.
- **Confirm 없는 dangerous**: 결제 / 삭제 자동.
- **Max iter 없음**: LLM 무한 loop.
- **Cost monitoring X**: 청구서 폭발.
- **자체 prod scraping ToS 무시**: 차단 / 법적.
- **Screen recording log**: PII / password.
- **CAPTCHA 자동 풀기**: ToS 위반 거의 항상.
## 🤖 LLM 활용 힌트
- Web = Stagehand 빠른 시작.
- Computer Use = container 권장 (sandbox).
- Set-of-Marks 가 정확도 ↑.
- Confirm dangerous + budget cap.
## 🔗 관련 문서
- [[AI_Function_Calling_Deep]]
- [[AI_Agentic_Patterns]]
- [[AI_Multimodal_Vision_Patterns]]