7.8 KiB
7.8 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ai-vision-agents | Vision Agents — 화면 / OCR / Browser 자동화 | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Vision Agents
LLM 이 screenshot 보고 클릭 / 입력. Anthropic Computer Use, OpenAI Operator, browser-use, Stagehand. Action loop = screenshot → LLM → action → repeat.
📖 핵심 개념
- Screenshot / DOM 기반.
- Action: click(x, y), type(text), scroll, key.
- Browser: Playwright / Selenium 자동화.
- Desktop: 시스템 권한 + Accessibility.
💻 코드 패턴
Anthropic Computer Use
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function computerUseLoop(task: string) {
const messages: any[] = [{ role: 'user', content: task }];
for (let i = 0; i < 30; i++) {
const r = await client.beta.messages.create({
model: 'claude-opus-4-7',
max_tokens: 1024,
tools: [{
type: 'computer_20250124',
name: 'computer',
display_width_px: 1280,
display_height_px: 800,
}],
messages,
betas: ['computer-use-2025-01-24'],
});
messages.push({ role: 'assistant', content: r.content });
if (r.stop_reason === 'end_turn') return r;
// tool_use blocks
const toolUses = r.content.filter(b => b.type === 'tool_use');
const toolResults = await Promise.all(toolUses.map(async (t) => {
const result = await executeAction(t.input);
return {
type: 'tool_result' as const,
tool_use_id: t.id,
content: result,
};
}));
messages.push({ role: 'user', content: toolResults });
}
}
async function executeAction(input: any) {
switch (input.action) {
case 'screenshot': {
const buf = await screenshot();
return [{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: buf.toString('base64') } }];
}
case 'left_click':
await clickAt(input.coordinate);
return [{ type: 'text', text: 'clicked' }];
case 'type':
await typeText(input.text);
return [{ type: 'text', text: 'typed' }];
case 'key':
await pressKey(input.text);
return [{ type: 'text', text: 'pressed' }];
case 'scroll':
await scroll(input.direction, input.amount);
return [{ type: 'text', text: 'scrolled' }];
}
}
Browser agent (Playwright + Claude / GPT)
import { chromium } from 'playwright';
import Anthropic from '@anthropic-ai/sdk';
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
async function browserLoop(task: string) {
// ... agent loop with tools:
const tools = [
{
name: 'screenshot',
description: 'Take a screenshot of the page',
input_schema: { type: 'object', properties: {} },
},
{
name: 'click',
description: 'Click element by visible text or selector',
input_schema: {
type: 'object',
properties: { selector: { type: 'string' }, text: { type: 'string' } },
},
},
{
name: 'type',
description: 'Type into input',
input_schema: { type: 'object', properties: { selector: { type: 'string' }, text: { type: 'string' } } },
},
{
name: 'goto',
description: 'Navigate to URL',
input_schema: { type: 'object', properties: { url: { type: 'string' } } },
},
];
}
Stagehand (Browserbase, modern)
import { Stagehand } from '@browserbasehq/stagehand';
const stagehand = new Stagehand({ env: 'LOCAL' });
await stagehand.init();
const page = stagehand.page;
await page.goto('https://docs.example.com');
// 자연어 action
await page.act('Click the "Get Started" button');
await page.act('Type "search query" into the search bar');
// 자연어 extract
const result = await page.extract({
instruction: 'extract the first 3 article titles',
schema: z.object({ titles: z.array(z.string()) }),
});
// 자연어 observe
const action = await page.observe({ instruction: 'Find the login button' });
→ 가장 단순한 production-ready browser agent.
browser-use (Python, popular)
from browser_use import Agent
from langchain_anthropic import ChatAnthropic
agent = Agent(
task='Find the cheapest flight from Seoul to Tokyo on May 15',
llm=ChatAnthropic(model='claude-opus-4-7'),
)
result = await agent.run()
Set-of-Marks (SoM)
Screenshot 위에 click 가능 element 마다 번호 라벨.
LLM 이 "click element 7" 같이 말함.
→ Coordinate-based 보다 정확.
// Element 마다 박스 + 번호 그림
const labeled = await page.evaluate(() => {
const elements = document.querySelectorAll('a, button, input, [role="button"]');
return elements.map((el, i) => {
const rect = el.getBoundingClientRect();
return { idx: i, x: rect.x, y: rect.y, w: rect.width, h: rect.height };
});
});
// canvas 에 박스 + 숫자 → screenshot
OCR agent (textract, paddle, tesseract)
// 이미지 → 텍스트
import Tesseract from 'tesseract.js';
const r = await Tesseract.recognize('document.png', 'eng+kor');
console.log(r.data.text);
// 또는 LLM vision (정확)
const r = await anthropic.messages.create({
model: 'claude-opus-4-7',
messages: [{ role: 'user', content: [
{ type: 'image', source: { ... } },
{ type: 'text', text: 'Extract all text. Output JSON with fields.' },
]}],
});
→ Receipt / form / table 처리.
Desktop automation (cross-platform)
// Anthropic computer-use container
// 또는 nut-tree (Node) / pyautogui (Python)
import { mouse, keyboard, screen } from '@nut-tree-fork/nut-js';
await mouse.move(centerOf(await screen.find('button.png')));
await mouse.click(Button.LEFT);
Anti-bot / detection
사이트가 bot 검출 → CAPTCHA / 차단.
대응:
- Playwright stealth plugin
- Browserbase / Anchor (cloud) — IP / fingerprint 처리
- 적절 delay / mouse movement
→ 사이트 ToS 확인.
Cost
Computer Use: 매 turn screenshot + LLM call.
큰 task (100 step) = $5+.
→ Self-host LLM (Vision-capable) 또는 cache.
Test
복잡 — 같은 화면이 매번 다를 수 있음.
- Mock browser
- Recorded scenarios
- Smoke test ("로그인" 같은 핵심 path)
안전
// 사용자 confirm dangerous
const dangerous = ['delete', 'pay', 'send'];
if (toolUse.input.action === 'left_click') {
const target = await getElementText(toolUse.input.coordinate);
if (dangerous.some(d => target.toLowerCase().includes(d))) {
const ok = await confirmWithUser(`Click "${target}"?`);
if (!ok) return { skipped: true };
}
}
🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| 일반 web 자동화 | Stagehand (modern) |
| 고급 / Open source | browser-use |
| Cloud-hosted browser | Browserbase + Stagehand |
| Desktop GUI | Anthropic Computer Use container |
| Form / receipt OCR | LLM Vision |
| Reliable existing flow | Playwright fixed script |
❌ 안티패턴
- Coordinate hardcode: viewport / 해상도 차이. text / selector.
- Confirm 없는 dangerous: 결제 / 삭제 자동.
- Max iter 없음: LLM 무한 loop.
- Cost monitoring X: 청구서 폭발.
- 자체 prod scraping ToS 무시: 차단 / 법적.
- Screen recording log: PII / password.
- CAPTCHA 자동 풀기: ToS 위반 거의 항상.
🤖 LLM 활용 힌트
- Web = Stagehand 빠른 시작.
- Computer Use = container 권장 (sandbox).
- Set-of-Marks 가 정확도 ↑.
- Confirm dangerous + budget cap.