Files
2nd/10_Wiki/Topics/Coding/AI_Vision_Agents.md
T
2026-05-09 21:08:02 +09:00

7.8 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-vision-agents Vision Agents — 화면 / OCR / Browser 자동화 Coding draft B conceptual 2026-05-09 2026-05-09
ai
vision
agent
automation
vibe-coding
language applicable_to
TS / Python
Backend
computer use
browser agent
screen agent
OCR agent
GUI automation
Claude Computer Use

Vision Agents

LLM 이 screenshot 보고 클릭 / 입력. Anthropic Computer Use, OpenAI Operator, browser-use, Stagehand. Action loop = screenshot → LLM → action → repeat.

📖 핵심 개념

  • Screenshot / DOM 기반.
  • Action: click(x, y), type(text), scroll, key.
  • Browser: Playwright / Selenium 자동화.
  • Desktop: 시스템 권한 + Accessibility.

💻 코드 패턴

Anthropic Computer Use

import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();

async function computerUseLoop(task: string) {
  const messages: any[] = [{ role: 'user', content: task }];
  
  for (let i = 0; i < 30; i++) {
    const r = await client.beta.messages.create({
      model: 'claude-opus-4-7',
      max_tokens: 1024,
      tools: [{
        type: 'computer_20250124',
        name: 'computer',
        display_width_px: 1280,
        display_height_px: 800,
      }],
      messages,
      betas: ['computer-use-2025-01-24'],
    });
    
    messages.push({ role: 'assistant', content: r.content });
    
    if (r.stop_reason === 'end_turn') return r;
    
    // tool_use blocks
    const toolUses = r.content.filter(b => b.type === 'tool_use');
    const toolResults = await Promise.all(toolUses.map(async (t) => {
      const result = await executeAction(t.input);
      return {
        type: 'tool_result' as const,
        tool_use_id: t.id,
        content: result,
      };
    }));
    
    messages.push({ role: 'user', content: toolResults });
  }
}

async function executeAction(input: any) {
  switch (input.action) {
    case 'screenshot': {
      const buf = await screenshot();
      return [{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: buf.toString('base64') } }];
    }
    case 'left_click':
      await clickAt(input.coordinate);
      return [{ type: 'text', text: 'clicked' }];
    case 'type':
      await typeText(input.text);
      return [{ type: 'text', text: 'typed' }];
    case 'key':
      await pressKey(input.text);
      return [{ type: 'text', text: 'pressed' }];
    case 'scroll':
      await scroll(input.direction, input.amount);
      return [{ type: 'text', text: 'scrolled' }];
  }
}

Browser agent (Playwright + Claude / GPT)

import { chromium } from 'playwright';
import Anthropic from '@anthropic-ai/sdk';

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com');

async function browserLoop(task: string) {
  // ... agent loop with tools:
  const tools = [
    {
      name: 'screenshot',
      description: 'Take a screenshot of the page',
      input_schema: { type: 'object', properties: {} },
    },
    {
      name: 'click',
      description: 'Click element by visible text or selector',
      input_schema: {
        type: 'object',
        properties: { selector: { type: 'string' }, text: { type: 'string' } },
      },
    },
    {
      name: 'type',
      description: 'Type into input',
      input_schema: { type: 'object', properties: { selector: { type: 'string' }, text: { type: 'string' } } },
    },
    {
      name: 'goto',
      description: 'Navigate to URL',
      input_schema: { type: 'object', properties: { url: { type: 'string' } } },
    },
  ];
}

Stagehand (Browserbase, modern)

import { Stagehand } from '@browserbasehq/stagehand';

const stagehand = new Stagehand({ env: 'LOCAL' });
await stagehand.init();
const page = stagehand.page;

await page.goto('https://docs.example.com');

// 자연어 action
await page.act('Click the "Get Started" button');
await page.act('Type "search query" into the search bar');

// 자연어 extract
const result = await page.extract({
  instruction: 'extract the first 3 article titles',
  schema: z.object({ titles: z.array(z.string()) }),
});

// 자연어 observe
const action = await page.observe({ instruction: 'Find the login button' });

→ 가장 단순한 production-ready browser agent.

from browser_use import Agent
from langchain_anthropic import ChatAnthropic

agent = Agent(
    task='Find the cheapest flight from Seoul to Tokyo on May 15',
    llm=ChatAnthropic(model='claude-opus-4-7'),
)

result = await agent.run()

Set-of-Marks (SoM)

Screenshot 위에 click 가능 element 마다 번호 라벨.
LLM 이 "click element 7" 같이 말함.
→ Coordinate-based 보다 정확.
// Element 마다 박스 + 번호 그림
const labeled = await page.evaluate(() => {
  const elements = document.querySelectorAll('a, button, input, [role="button"]');
  return elements.map((el, i) => {
    const rect = el.getBoundingClientRect();
    return { idx: i, x: rect.x, y: rect.y, w: rect.width, h: rect.height };
  });
});
// canvas 에 박스 + 숫자 → screenshot

OCR agent (textract, paddle, tesseract)

// 이미지 → 텍스트
import Tesseract from 'tesseract.js';
const r = await Tesseract.recognize('document.png', 'eng+kor');
console.log(r.data.text);

// 또는 LLM vision (정확)
const r = await anthropic.messages.create({
  model: 'claude-opus-4-7',
  messages: [{ role: 'user', content: [
    { type: 'image', source: { ... } },
    { type: 'text', text: 'Extract all text. Output JSON with fields.' },
  ]}],
});

→ Receipt / form / table 처리.

Desktop automation (cross-platform)

// Anthropic computer-use container
// 또는 nut-tree (Node) / pyautogui (Python)
import { mouse, keyboard, screen } from '@nut-tree-fork/nut-js';
await mouse.move(centerOf(await screen.find('button.png')));
await mouse.click(Button.LEFT);

Anti-bot / detection

사이트가 bot 검출 → CAPTCHA / 차단.

대응:
- Playwright stealth plugin
- Browserbase / Anchor (cloud) — IP / fingerprint 처리
- 적절 delay / mouse movement

→ 사이트 ToS 확인.

Cost

Computer Use: 매 turn screenshot + LLM call.
큰 task (100 step) = $5+.

→ Self-host LLM (Vision-capable) 또는 cache.

Test

복잡 — 같은 화면이 매번 다를 수 있음.
- Mock browser
- Recorded scenarios
- Smoke test ("로그인" 같은 핵심 path)

안전

// 사용자 confirm dangerous
const dangerous = ['delete', 'pay', 'send'];
if (toolUse.input.action === 'left_click') {
  const target = await getElementText(toolUse.input.coordinate);
  if (dangerous.some(d => target.toLowerCase().includes(d))) {
    const ok = await confirmWithUser(`Click "${target}"?`);
    if (!ok) return { skipped: true };
  }
}

🤔 의사결정 기준

작업 추천
일반 web 자동화 Stagehand (modern)
고급 / Open source browser-use
Cloud-hosted browser Browserbase + Stagehand
Desktop GUI Anthropic Computer Use container
Form / receipt OCR LLM Vision
Reliable existing flow Playwright fixed script

안티패턴

  • Coordinate hardcode: viewport / 해상도 차이. text / selector.
  • Confirm 없는 dangerous: 결제 / 삭제 자동.
  • Max iter 없음: LLM 무한 loop.
  • Cost monitoring X: 청구서 폭발.
  • 자체 prod scraping ToS 무시: 차단 / 법적.
  • Screen recording log: PII / password.
  • CAPTCHA 자동 풀기: ToS 위반 거의 항상.

🤖 LLM 활용 힌트

  • Web = Stagehand 빠른 시작.
  • Computer Use = container 권장 (sandbox).
  • Set-of-Marks 가 정확도 ↑.
  • Confirm dangerous + budget cap.

🔗 관련 문서