286 lines
7.8 KiB
Markdown
286 lines
7.8 KiB
Markdown
---
|
|
id: ai-vision-agents
|
|
title: Vision Agents — 화면 / OCR / Browser 자동화
|
|
category: Coding
|
|
status: draft
|
|
source_trust_level: B
|
|
verification_status: conceptual
|
|
created_at: 2026-05-09
|
|
updated_at: 2026-05-09
|
|
tags: [ai, vision, agent, automation, vibe-coding]
|
|
tech_stack: { language: "TS / Python", applicable_to: ["Backend"] }
|
|
applied_in: []
|
|
aliases: [computer use, browser agent, screen agent, OCR agent, GUI automation, Claude Computer Use]
|
|
---
|
|
|
|
# Vision Agents
|
|
|
|
> LLM 이 screenshot 보고 클릭 / 입력. **Anthropic Computer Use, OpenAI Operator, browser-use, Stagehand**. Action loop = screenshot → LLM → action → repeat.
|
|
|
|
## 📖 핵심 개념
|
|
- Screenshot / DOM 기반.
|
|
- Action: click(x, y), type(text), scroll, key.
|
|
- Browser: Playwright / Selenium 자동화.
|
|
- Desktop: 시스템 권한 + Accessibility.
|
|
|
|
## 💻 코드 패턴
|
|
|
|
### Anthropic Computer Use
|
|
```ts
|
|
import Anthropic from '@anthropic-ai/sdk';
|
|
const client = new Anthropic();
|
|
|
|
async function computerUseLoop(task: string) {
|
|
const messages: any[] = [{ role: 'user', content: task }];
|
|
|
|
for (let i = 0; i < 30; i++) {
|
|
const r = await client.beta.messages.create({
|
|
model: 'claude-opus-4-7',
|
|
max_tokens: 1024,
|
|
tools: [{
|
|
type: 'computer_20250124',
|
|
name: 'computer',
|
|
display_width_px: 1280,
|
|
display_height_px: 800,
|
|
}],
|
|
messages,
|
|
betas: ['computer-use-2025-01-24'],
|
|
});
|
|
|
|
messages.push({ role: 'assistant', content: r.content });
|
|
|
|
if (r.stop_reason === 'end_turn') return r;
|
|
|
|
// tool_use blocks
|
|
const toolUses = r.content.filter(b => b.type === 'tool_use');
|
|
const toolResults = await Promise.all(toolUses.map(async (t) => {
|
|
const result = await executeAction(t.input);
|
|
return {
|
|
type: 'tool_result' as const,
|
|
tool_use_id: t.id,
|
|
content: result,
|
|
};
|
|
}));
|
|
|
|
messages.push({ role: 'user', content: toolResults });
|
|
}
|
|
}
|
|
|
|
async function executeAction(input: any) {
|
|
switch (input.action) {
|
|
case 'screenshot': {
|
|
const buf = await screenshot();
|
|
return [{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: buf.toString('base64') } }];
|
|
}
|
|
case 'left_click':
|
|
await clickAt(input.coordinate);
|
|
return [{ type: 'text', text: 'clicked' }];
|
|
case 'type':
|
|
await typeText(input.text);
|
|
return [{ type: 'text', text: 'typed' }];
|
|
case 'key':
|
|
await pressKey(input.text);
|
|
return [{ type: 'text', text: 'pressed' }];
|
|
case 'scroll':
|
|
await scroll(input.direction, input.amount);
|
|
return [{ type: 'text', text: 'scrolled' }];
|
|
}
|
|
}
|
|
```
|
|
|
|
### Browser agent (Playwright + Claude / GPT)
|
|
```ts
|
|
import { chromium } from 'playwright';
|
|
import Anthropic from '@anthropic-ai/sdk';
|
|
|
|
const browser = await chromium.launch();
|
|
const page = await browser.newPage();
|
|
await page.goto('https://example.com');
|
|
|
|
async function browserLoop(task: string) {
|
|
// ... agent loop with tools:
|
|
const tools = [
|
|
{
|
|
name: 'screenshot',
|
|
description: 'Take a screenshot of the page',
|
|
input_schema: { type: 'object', properties: {} },
|
|
},
|
|
{
|
|
name: 'click',
|
|
description: 'Click element by visible text or selector',
|
|
input_schema: {
|
|
type: 'object',
|
|
properties: { selector: { type: 'string' }, text: { type: 'string' } },
|
|
},
|
|
},
|
|
{
|
|
name: 'type',
|
|
description: 'Type into input',
|
|
input_schema: { type: 'object', properties: { selector: { type: 'string' }, text: { type: 'string' } } },
|
|
},
|
|
{
|
|
name: 'goto',
|
|
description: 'Navigate to URL',
|
|
input_schema: { type: 'object', properties: { url: { type: 'string' } } },
|
|
},
|
|
];
|
|
}
|
|
```
|
|
|
|
### Stagehand (Browserbase, modern)
|
|
```ts
|
|
import { Stagehand } from '@browserbasehq/stagehand';
|
|
|
|
const stagehand = new Stagehand({ env: 'LOCAL' });
|
|
await stagehand.init();
|
|
const page = stagehand.page;
|
|
|
|
await page.goto('https://docs.example.com');
|
|
|
|
// 자연어 action
|
|
await page.act('Click the "Get Started" button');
|
|
await page.act('Type "search query" into the search bar');
|
|
|
|
// 자연어 extract
|
|
const result = await page.extract({
|
|
instruction: 'extract the first 3 article titles',
|
|
schema: z.object({ titles: z.array(z.string()) }),
|
|
});
|
|
|
|
// 자연어 observe
|
|
const action = await page.observe({ instruction: 'Find the login button' });
|
|
```
|
|
|
|
→ 가장 단순한 production-ready browser agent.
|
|
|
|
### browser-use (Python, popular)
|
|
```python
|
|
from browser_use import Agent
|
|
from langchain_anthropic import ChatAnthropic
|
|
|
|
agent = Agent(
|
|
task='Find the cheapest flight from Seoul to Tokyo on May 15',
|
|
llm=ChatAnthropic(model='claude-opus-4-7'),
|
|
)
|
|
|
|
result = await agent.run()
|
|
```
|
|
|
|
### Set-of-Marks (SoM)
|
|
```
|
|
Screenshot 위에 click 가능 element 마다 번호 라벨.
|
|
LLM 이 "click element 7" 같이 말함.
|
|
→ Coordinate-based 보다 정확.
|
|
```
|
|
|
|
```ts
|
|
// Element 마다 박스 + 번호 그림
|
|
const labeled = await page.evaluate(() => {
|
|
const elements = document.querySelectorAll('a, button, input, [role="button"]');
|
|
return elements.map((el, i) => {
|
|
const rect = el.getBoundingClientRect();
|
|
return { idx: i, x: rect.x, y: rect.y, w: rect.width, h: rect.height };
|
|
});
|
|
});
|
|
// canvas 에 박스 + 숫자 → screenshot
|
|
```
|
|
|
|
### OCR agent (textract, paddle, tesseract)
|
|
```ts
|
|
// 이미지 → 텍스트
|
|
import Tesseract from 'tesseract.js';
|
|
const r = await Tesseract.recognize('document.png', 'eng+kor');
|
|
console.log(r.data.text);
|
|
|
|
// 또는 LLM vision (정확)
|
|
const r = await anthropic.messages.create({
|
|
model: 'claude-opus-4-7',
|
|
messages: [{ role: 'user', content: [
|
|
{ type: 'image', source: { ... } },
|
|
{ type: 'text', text: 'Extract all text. Output JSON with fields.' },
|
|
]}],
|
|
});
|
|
```
|
|
|
|
→ Receipt / form / table 처리.
|
|
|
|
### Desktop automation (cross-platform)
|
|
```ts
|
|
// Anthropic computer-use container
|
|
// 또는 nut-tree (Node) / pyautogui (Python)
|
|
import { mouse, keyboard, screen } from '@nut-tree-fork/nut-js';
|
|
await mouse.move(centerOf(await screen.find('button.png')));
|
|
await mouse.click(Button.LEFT);
|
|
```
|
|
|
|
### Anti-bot / detection
|
|
```
|
|
사이트가 bot 검출 → CAPTCHA / 차단.
|
|
|
|
대응:
|
|
- Playwright stealth plugin
|
|
- Browserbase / Anchor (cloud) — IP / fingerprint 처리
|
|
- 적절 delay / mouse movement
|
|
```
|
|
|
|
→ 사이트 ToS 확인.
|
|
|
|
### Cost
|
|
```
|
|
Computer Use: 매 turn screenshot + LLM call.
|
|
큰 task (100 step) = $5+.
|
|
```
|
|
|
|
→ Self-host LLM (Vision-capable) 또는 cache.
|
|
|
|
### Test
|
|
```
|
|
복잡 — 같은 화면이 매번 다를 수 있음.
|
|
- Mock browser
|
|
- Recorded scenarios
|
|
- Smoke test ("로그인" 같은 핵심 path)
|
|
```
|
|
|
|
### 안전
|
|
```ts
|
|
// 사용자 confirm dangerous
|
|
const dangerous = ['delete', 'pay', 'send'];
|
|
if (toolUse.input.action === 'left_click') {
|
|
const target = await getElementText(toolUse.input.coordinate);
|
|
if (dangerous.some(d => target.toLowerCase().includes(d))) {
|
|
const ok = await confirmWithUser(`Click "${target}"?`);
|
|
if (!ok) return { skipped: true };
|
|
}
|
|
}
|
|
```
|
|
|
|
## 🤔 의사결정 기준
|
|
| 작업 | 추천 |
|
|
|---|---|
|
|
| 일반 web 자동화 | Stagehand (modern) |
|
|
| 고급 / Open source | browser-use |
|
|
| Cloud-hosted browser | Browserbase + Stagehand |
|
|
| Desktop GUI | Anthropic Computer Use container |
|
|
| Form / receipt OCR | LLM Vision |
|
|
| Reliable existing flow | Playwright fixed script |
|
|
|
|
## ❌ 안티패턴
|
|
- **Coordinate hardcode**: viewport / 해상도 차이. text / selector.
|
|
- **Confirm 없는 dangerous**: 결제 / 삭제 자동.
|
|
- **Max iter 없음**: LLM 무한 loop.
|
|
- **Cost monitoring X**: 청구서 폭발.
|
|
- **자체 prod scraping ToS 무시**: 차단 / 법적.
|
|
- **Screen recording log**: PII / password.
|
|
- **CAPTCHA 자동 풀기**: ToS 위반 거의 항상.
|
|
|
|
## 🤖 LLM 활용 힌트
|
|
- Web = Stagehand 빠른 시작.
|
|
- Computer Use = container 권장 (sandbox).
|
|
- Set-of-Marks 가 정확도 ↑.
|
|
- Confirm dangerous + budget cap.
|
|
|
|
## 🔗 관련 문서
|
|
- [[AI_Function_Calling_Deep]]
|
|
- [[AI_Agentic_Patterns]]
|
|
- [[AI_Multimodal_Vision_Patterns]]
|