--- id: ai-vision-agents title: Vision Agents — 화면 / OCR / Browser 자동화 category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, vision, agent, automation, vibe-coding] tech_stack: { language: "TS / Python", applicable_to: ["Backend"] } applied_in: [] aliases: [computer use, browser agent, screen agent, OCR agent, GUI automation, Claude Computer Use] --- # Vision Agents > LLM 이 screenshot 보고 클릭 / 입력. **Anthropic Computer Use, OpenAI Operator, browser-use, Stagehand**. Action loop = screenshot → LLM → action → repeat. ## 📖 핵심 개념 - Screenshot / DOM 기반. - Action: click(x, y), type(text), scroll, key. - Browser: Playwright / Selenium 자동화. - Desktop: 시스템 권한 + Accessibility. ## 💻 코드 패턴 ### Anthropic Computer Use ```ts import Anthropic from '@anthropic-ai/sdk'; const client = new Anthropic(); async function computerUseLoop(task: string) { const messages: any[] = [{ role: 'user', content: task }]; for (let i = 0; i < 30; i++) { const r = await client.beta.messages.create({ model: 'claude-opus-4-7', max_tokens: 1024, tools: [{ type: 'computer_20250124', name: 'computer', display_width_px: 1280, display_height_px: 800, }], messages, betas: ['computer-use-2025-01-24'], }); messages.push({ role: 'assistant', content: r.content }); if (r.stop_reason === 'end_turn') return r; // tool_use blocks const toolUses = r.content.filter(b => b.type === 'tool_use'); const toolResults = await Promise.all(toolUses.map(async (t) => { const result = await executeAction(t.input); return { type: 'tool_result' as const, tool_use_id: t.id, content: result, }; })); messages.push({ role: 'user', content: toolResults }); } } async function executeAction(input: any) { switch (input.action) { case 'screenshot': { const buf = await screenshot(); return [{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: buf.toString('base64') } }]; } case 'left_click': await clickAt(input.coordinate); return [{ type: 'text', text: 'clicked' }]; case 'type': await typeText(input.text); return [{ type: 'text', text: 'typed' }]; case 'key': await pressKey(input.text); return [{ type: 'text', text: 'pressed' }]; case 'scroll': await scroll(input.direction, input.amount); return [{ type: 'text', text: 'scrolled' }]; } } ``` ### Browser agent (Playwright + Claude / GPT) ```ts import { chromium } from 'playwright'; import Anthropic from '@anthropic-ai/sdk'; const browser = await chromium.launch(); const page = await browser.newPage(); await page.goto('https://example.com'); async function browserLoop(task: string) { // ... agent loop with tools: const tools = [ { name: 'screenshot', description: 'Take a screenshot of the page', input_schema: { type: 'object', properties: {} }, }, { name: 'click', description: 'Click element by visible text or selector', input_schema: { type: 'object', properties: { selector: { type: 'string' }, text: { type: 'string' } }, }, }, { name: 'type', description: 'Type into input', input_schema: { type: 'object', properties: { selector: { type: 'string' }, text: { type: 'string' } } }, }, { name: 'goto', description: 'Navigate to URL', input_schema: { type: 'object', properties: { url: { type: 'string' } } }, }, ]; } ``` ### Stagehand (Browserbase, modern) ```ts import { Stagehand } from '@browserbasehq/stagehand'; const stagehand = new Stagehand({ env: 'LOCAL' }); await stagehand.init(); const page = stagehand.page; await page.goto('https://docs.example.com'); // 자연어 action await page.act('Click the "Get Started" button'); await page.act('Type "search query" into the search bar'); // 자연어 extract const result = await page.extract({ instruction: 'extract the first 3 article titles', schema: z.object({ titles: z.array(z.string()) }), }); // 자연어 observe const action = await page.observe({ instruction: 'Find the login button' }); ``` → 가장 단순한 production-ready browser agent. ### browser-use (Python, popular) ```python from browser_use import Agent from langchain_anthropic import ChatAnthropic agent = Agent( task='Find the cheapest flight from Seoul to Tokyo on May 15', llm=ChatAnthropic(model='claude-opus-4-7'), ) result = await agent.run() ``` ### Set-of-Marks (SoM) ``` Screenshot 위에 click 가능 element 마다 번호 라벨. LLM 이 "click element 7" 같이 말함. → Coordinate-based 보다 정확. ``` ```ts // Element 마다 박스 + 번호 그림 const labeled = await page.evaluate(() => { const elements = document.querySelectorAll('a, button, input, [role="button"]'); return elements.map((el, i) => { const rect = el.getBoundingClientRect(); return { idx: i, x: rect.x, y: rect.y, w: rect.width, h: rect.height }; }); }); // canvas 에 박스 + 숫자 → screenshot ``` ### OCR agent (textract, paddle, tesseract) ```ts // 이미지 → 텍스트 import Tesseract from 'tesseract.js'; const r = await Tesseract.recognize('document.png', 'eng+kor'); console.log(r.data.text); // 또는 LLM vision (정확) const r = await anthropic.messages.create({ model: 'claude-opus-4-7', messages: [{ role: 'user', content: [ { type: 'image', source: { ... } }, { type: 'text', text: 'Extract all text. Output JSON with fields.' }, ]}], }); ``` → Receipt / form / table 처리. ### Desktop automation (cross-platform) ```ts // Anthropic computer-use container // 또는 nut-tree (Node) / pyautogui (Python) import { mouse, keyboard, screen } from '@nut-tree-fork/nut-js'; await mouse.move(centerOf(await screen.find('button.png'))); await mouse.click(Button.LEFT); ``` ### Anti-bot / detection ``` 사이트가 bot 검출 → CAPTCHA / 차단. 대응: - Playwright stealth plugin - Browserbase / Anchor (cloud) — IP / fingerprint 처리 - 적절 delay / mouse movement ``` → 사이트 ToS 확인. ### Cost ``` Computer Use: 매 turn screenshot + LLM call. 큰 task (100 step) = $5+. ``` → Self-host LLM (Vision-capable) 또는 cache. ### Test ``` 복잡 — 같은 화면이 매번 다를 수 있음. - Mock browser - Recorded scenarios - Smoke test ("로그인" 같은 핵심 path) ``` ### 안전 ```ts // 사용자 confirm dangerous const dangerous = ['delete', 'pay', 'send']; if (toolUse.input.action === 'left_click') { const target = await getElementText(toolUse.input.coordinate); if (dangerous.some(d => target.toLowerCase().includes(d))) { const ok = await confirmWithUser(`Click "${target}"?`); if (!ok) return { skipped: true }; } } ``` ## 🤔 의사결정 기준 | 작업 | 추천 | |---|---| | 일반 web 자동화 | Stagehand (modern) | | 고급 / Open source | browser-use | | Cloud-hosted browser | Browserbase + Stagehand | | Desktop GUI | Anthropic Computer Use container | | Form / receipt OCR | LLM Vision | | Reliable existing flow | Playwright fixed script | ## ❌ 안티패턴 - **Coordinate hardcode**: viewport / 해상도 차이. text / selector. - **Confirm 없는 dangerous**: 결제 / 삭제 자동. - **Max iter 없음**: LLM 무한 loop. - **Cost monitoring X**: 청구서 폭발. - **자체 prod scraping ToS 무시**: 차단 / 법적. - **Screen recording log**: PII / password. - **CAPTCHA 자동 풀기**: ToS 위반 거의 항상. ## 🤖 LLM 활용 힌트 - Web = Stagehand 빠른 시작. - Computer Use = container 권장 (sandbox). - Set-of-Marks 가 정확도 ↑. - Confirm dangerous + budget cap. ## 🔗 관련 문서 - [[AI_Function_Calling_Deep]] - [[AI_Agentic_Patterns]] - [[AI_Multimodal_Vision_Patterns]]