Files
2nd/10_Wiki/Topics/Coding/AI_Browser_Agent_Patterns.md
T
2026-05-10 22:08:15 +09:00

8.4 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-browser-agent-patterns Browser Agent — Playwright / Puppeteer / browser-use Coding draft B conceptual 2026-05-09 2026-05-09
ai
agent
browser
vibe-coding
language applicable_to
TS / Python
AI
browser agent
web agent
Playwright agent
browser-use
Computer Use
accessibility tree

Browser Agent

LLM 가 browser 사용 — click, type, scroll. Anthropic Computer Use, browser-use, Playwright + LLM. Web automation 의 모던.

📖 핵심 개념

  • Screenshot 또는 accessibility tree 가 input.
  • LLM 가 action 결정 (click x,y / type / scroll).
  • Loop until task done.
  • 신뢰성 / 비용 / 속도 trade-off.

💻 코드 패턴

Playwright + LLM (간단)

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com');

// Screenshot → LLM
const screenshot = await page.screenshot();
const action = await llm.complete({
  system: 'You are a browser agent. Output JSON: {action: click|type|scroll, ...}',
  messages: [
    { role: 'user', content: [
      { type: 'image', source: { type: 'base64', data: screenshot.toString('base64') } },
      { type: 'text', text: 'Search for "hello world"' },
    ]},
  ],
});

// Execute
if (action.action === 'click') await page.mouse.click(action.x, action.y);
if (action.action === 'type') await page.keyboard.type(action.text);

Anthropic Computer Use

import Anthropic from '@anthropic-ai/sdk';

const r = await client.messages.create({
  model: 'claude-opus-4-7',
  tools: [{
    type: 'computer_20241022',
    name: 'computer',
    display_width_px: 1024,
    display_height_px: 768,
  }],
  messages: [{
    role: 'user',
    content: [
      { type: 'image', source: { ... } },
      { type: 'text', text: 'Find the login button and click it' },
    ],
  }],
});

// r.content 가 tool_use → execute
for (const c of r.content) {
  if (c.type === 'tool_use' && c.name === 'computer') {
    const { action, coordinate } = c.input;
    if (action === 'left_click') await page.mouse.click(...coordinate);
    // ...
  }
}

→ Claude 가 native browser tool.

browser-use (Python framework)

from browser_use import Agent
from langchain.chat_models import ChatOpenAI

agent = Agent(
    task='Find the cheapest flight from Seoul to Tokyo on Jun 1',
    llm=ChatOpenAI(model='gpt-4o'),
)
result = await agent.run()

→ Library 가 loop / accessibility / 안정성 처리.

Accessibility tree (DOM 기반)

const snapshot = await page.accessibility.snapshot();
// { name: 'Page', children: [{ role: 'button', name: 'Login' }, ...] }

// LLM 에 ax tree 전달
const action = await llm.complete({
  prompt: `Tree: ${JSON.stringify(snapshot)}\nTask: ...`,
});

→ Screenshot 보다 정확. Vision model 안 필요.

Element ID assignment

// 매 element 에 ID 추가 → LLM 가 ID 로 click.
await page.evaluate(() => {
  document.querySelectorAll('button, a, input').forEach((el, i) => {
    el.setAttribute('data-agent-id', i);
  });
});

// Screenshot + label 가 visible
// LLM: "click element with id 5"
await page.click('[data-agent-id="5"]');

→ Coordinate 가 brittle (resize). ID 가 stable.

Selector strategy

// LLM 가 CSS selector 생성
const action = await llm.complete({
  prompt: `Click the "Subscribe" button. Output: {selector}`,
});

await page.click(action.selector);
// ❌ "button:nth-child(3)" — brittle
// ✅ "button:has-text('Subscribe')" — semantic

→ Playwright 의 semantic selector 가 robust.

Loop until task done

for (let i = 0; i < 50; i++) {
  const screenshot = await page.screenshot();
  const action = await llm.complete({ ... });
  
  if (action.type === 'done') break;
  
  await execute(action);
  await page.waitForLoadState('networkidle');
}

→ Max iteration 제한 — infinite loop 방지.

Form filling

// LLM extract form fields
const fields = await page.evaluate(() => 
  [...document.querySelectorAll('input, select, textarea')].map(el => ({
    selector: el.outerHTML,
    type: el.type,
  }))
);

const fills = await llm.complete({
  prompt: `Fill form for "Alice, alice@x.com": ${JSON.stringify(fields)}`,
});

for (const fill of fills) {
  await page.fill(fill.selector, fill.value);
}

Multi-step task

"Order pizza":
1. Open URL
2. Click "Sign in"
3. Type email + password
4. Navigate to menu
5. Add pizza to cart
6. Checkout
7. Confirm

→ 매 step 가 LLM call.

Error handling

try {
  await page.click(selector, { timeout: 5000 });
} catch (e) {
  // Element 가 없거나 안 visible
  const screenshot = await page.screenshot();
  const action = await llm.complete({
    prompt: `Click failed: ${e.message}. Current screen: [image]. What to do?`,
  });
  // → Retry / scroll / different selector
}

Vision (multimodal)

// GPT-4V / Claude / Gemini 가 screenshot 본다.
const r = await llm.complete({
  messages: [
    { role: 'user', content: [
      { type: 'image', source: { type: 'base64', data: ss.toString('base64') } },
      { type: 'text', text: 'Find the login button. Output coordinate.' },
    ]},
  ],
});

→ Vision 가 큰 cost ↑.

비용

1 task ≈ 10-100 LLM call.
매 call = $0.01 - $0.10 (vision = 더).

Task = $0.10 - $10.

→ E-commerce automation 가능. 1 click 의 $.

Speed

LLM call 1-5 sec.
1 task = 30 sec - 5 min.

→ Human 보다 X 빠름. 24/7 + 병렬.

Use case

- Web scraping 의 새 (auth + dynamic UI)
- E2E test 작성 (LLM 가 test 생성)
- QA bot ("X feature broken?")
- Form submission automation
- Personal assistant (book ticket)
- Research agent (visit 5 site, summarize)

Browser-use 의 idea

- DOM tree 가 input
- Element 가 numbered
- LLM: "click 5"
- Browser: id 5 의 element 가 무엇? → execute

→ Coordinate brittleness 해결.

Sandbox

// Untrusted user input → sandboxed browser
const browser = await chromium.launch({
  args: ['--no-sandbox', '--disable-setuid-sandbox'],
});

→ Container / VM 가 안전.

Persistence

const context = await browser.newContext({
  storageState: 'auth.json',  // 옛 cookie 사용
});
const page = await context.newPage();
// → 로그인 상태 유지

→ 매 task 마다 login X.

Captcha 함정

- 자동 = bot detection.
- Captcha 가 LLM 못 풀.
- ToS 위반 가능 (scraping).

→ User 가 manual intervene 옵션.
또는 captcha solve service ($).

Anti-detection

- Random delay
- Real user-agent
- Fingerprint randomize
- Residential proxy

→ ToS 위반 방향. 합법적 use case 만.

Eval

# Task suite (WebArena, VisualWebArena)
tasks = load_dataset('webarena')
success = 0
for t in tasks:
    result = agent.run(t)
    if check(result, t.expected):
        success += 1
print(f'Success rate: {success / len(tasks):.1%}')

→ 2026 SoTA: 60-80% on standard task.

Limitations

- Captcha
- 매우 동적 SPA (state)
- Long task (10+ step)
- Privacy / login
- Cost (LLM call ↑)
- 부정확 (hallucinate)

Observability

// Action log
log({
  step: i,
  action: action,
  screenshot: ss,
  url: page.url(),
});

// Replay later

→ Debug 친화.

Real production

  • Devin (Cognition): code agent 가 browser 도.
  • Anthropic Computer Use: native API.
  • OpenAI Operator (2025): browser agent product.
  • Adept ACT-1: web action.

🤔 의사결정 기준

작업 추천
Simple scrape Playwright (no LLM)
Auth + dynamic Browser agent
QA / E2E Test 생성 + run
Research Browser-use library
Production Computer Use API
Cost-sensitive Selector + tool (no vision)
고난도 Vision + multi-step

안티패턴

  • Coordinate 만 (no element ID): brittle.
  • No max iteration: infinite loop.
  • Login 매번 새: cost / detection.
  • Captcha 없는 가정: production 깨짐.
  • No log: debug 불가.
  • ToS 무시: 법적 risk.
  • 모든 task vision: cost.

🤖 LLM 활용 힌트

  • Anthropic Computer Use 가 native.
  • Browser-use 가 production framework.
  • Element ID > coordinate.
  • Accessibility tree > screenshot (cost).

🔗 관련 문서