---
id: ai-browser-agent-patterns
title: Browser Agent — Playwright / Puppeteer / browser-use
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [ai, agent, browser, vibe-coding]
tech_stack: { language: "TS / Python", applicable_to: ["AI"] }
applied_in: []
aliases: [browser agent, web agent, Playwright agent, browser-use, Computer Use, accessibility tree]
---

# Browser Agent

> LLM 가 browser 사용 — click, type, scroll. **Anthropic Computer Use, browser-use, Playwright + LLM**. Web automation 의 모던.

## 📖 핵심 개념
- Screenshot 또는 accessibility tree 가 input.
- LLM 가 action 결정 (click x,y / type / scroll).
- Loop until task done.
- 신뢰성 / 비용 / 속도 trade-off.

## 💻 코드 패턴

### Playwright + LLM (간단)
```ts
import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com');

// Screenshot → LLM
const screenshot = await page.screenshot();
const action = await llm.complete({
  system: 'You are a browser agent. Output JSON: {action: click|type|scroll, ...}',
  messages: [
    { role: 'user', content: [
      { type: 'image', source: { type: 'base64', data: screenshot.toString('base64') } },
      { type: 'text', text: 'Search for "hello world"' },
    ]},
  ],
});

// Execute
if (action.action === 'click') await page.mouse.click(action.x, action.y);
if (action.action === 'type') await page.keyboard.type(action.text);
```

### Anthropic Computer Use
```ts
import Anthropic from '@anthropic-ai/sdk';

const r = await client.messages.create({
  model: 'claude-opus-4-7',
  tools: [{
    type: 'computer_20241022',
    name: 'computer',
    display_width_px: 1024,
    display_height_px: 768,
  }],
  messages: [{
    role: 'user',
    content: [
      { type: 'image', source: { ... } },
      { type: 'text', text: 'Find the login button and click it' },
    ],
  }],
});

// r.content 가 tool_use → execute
for (const c of r.content) {
  if (c.type === 'tool_use' && c.name === 'computer') {
    const { action, coordinate } = c.input;
    if (action === 'left_click') await page.mouse.click(...coordinate);
    // ...
  }
}
```

→ Claude 가 native browser tool.

### browser-use (Python framework)
```python
from browser_use import Agent
from langchain.chat_models import ChatOpenAI

agent = Agent(
    task='Find the cheapest flight from Seoul to Tokyo on Jun 1',
    llm=ChatOpenAI(model='gpt-4o'),
)
result = await agent.run()
```

→ Library 가 loop / accessibility / 안정성 처리.

### Accessibility tree (DOM 기반)
```ts
const snapshot = await page.accessibility.snapshot();
// { name: 'Page', children: [{ role: 'button', name: 'Login' }, ...] }

// LLM 에 ax tree 전달
const action = await llm.complete({
  prompt: `Tree: ${JSON.stringify(snapshot)}\nTask: ...`,
});
```

→ Screenshot 보다 정확. Vision model 안 필요.

### Element ID assignment
```ts
// 매 element 에 ID 추가 → LLM 가 ID 로 click.
await page.evaluate(() => {
  document.querySelectorAll('button, a, input').forEach((el, i) => {
    el.setAttribute('data-agent-id', i);
  });
});

// Screenshot + label 가 visible
// LLM: "click element with id 5"
await page.click('[data-agent-id="5"]');
```

→ Coordinate 가 brittle (resize). ID 가 stable.

### Selector strategy
```ts
// LLM 가 CSS selector 생성
const action = await llm.complete({
  prompt: `Click the "Subscribe" button. Output: {selector}`,
});

await page.click(action.selector);
// ❌ "button:nth-child(3)" — brittle
// ✅ "button:has-text('Subscribe')" — semantic
```

→ Playwright 의 semantic selector 가 robust.

### Loop until task done
```ts
for (let i = 0; i < 50; i++) {
  const screenshot = await page.screenshot();
  const action = await llm.complete({ ... });
  
  if (action.type === 'done') break;
  
  await execute(action);
  await page.waitForLoadState('networkidle');
}
```

→ Max iteration 제한 — infinite loop 방지.

### Form filling
```ts
// LLM extract form fields
const fields = await page.evaluate(() => 
  [...document.querySelectorAll('input, select, textarea')].map(el => ({
    selector: el.outerHTML,
    type: el.type,
  }))
);

const fills = await llm.complete({
  prompt: `Fill form for "Alice, alice@x.com": ${JSON.stringify(fields)}`,
});

for (const fill of fills) {
  await page.fill(fill.selector, fill.value);
}
```

### Multi-step task
```
"Order pizza":
1. Open URL
2. Click "Sign in"
3. Type email + password
4. Navigate to menu
5. Add pizza to cart
6. Checkout
7. Confirm

→ 매 step 가 LLM call.
```

### Error handling
```ts
try {
  await page.click(selector, { timeout: 5000 });
} catch (e) {
  // Element 가 없거나 안 visible
  const screenshot = await page.screenshot();
  const action = await llm.complete({
    prompt: `Click failed: ${e.message}. Current screen: [image]. What to do?`,
  });
  // → Retry / scroll / different selector
}
```

### Vision (multimodal)
```ts
// GPT-4V / Claude / Gemini 가 screenshot 본다.
const r = await llm.complete({
  messages: [
    { role: 'user', content: [
      { type: 'image', source: { type: 'base64', data: ss.toString('base64') } },
      { type: 'text', text: 'Find the login button. Output coordinate.' },
    ]},
  ],
});
```

→ Vision 가 큰 cost ↑.

### 비용
```
1 task ≈ 10-100 LLM call.
매 call = $0.01 - $0.10 (vision = 더).

Task = $0.10 - $10.

→ E-commerce automation 가능. 1 click 의 $.
```

### Speed
```
LLM call 1-5 sec.
1 task = 30 sec - 5 min.

→ Human 보다 X 빠름. 24/7 + 병렬.
```

### Use case
```
- Web scraping 의 새 (auth + dynamic UI)
- E2E test 작성 (LLM 가 test 생성)
- QA bot ("X feature broken?")
- Form submission automation
- Personal assistant (book ticket)
- Research agent (visit 5 site, summarize)
```

### Browser-use 의 idea
```
- DOM tree 가 input
- Element 가 numbered
- LLM: "click 5"
- Browser: id 5 의 element 가 무엇? → execute

→ Coordinate brittleness 해결.
```

### Sandbox
```ts
// Untrusted user input → sandboxed browser
const browser = await chromium.launch({
  args: ['--no-sandbox', '--disable-setuid-sandbox'],
});
```

→ Container / VM 가 안전.

### Persistence
```ts
const context = await browser.newContext({
  storageState: 'auth.json',  // 옛 cookie 사용
});
const page = await context.newPage();
// → 로그인 상태 유지
```

→ 매 task 마다 login X.

### Captcha 함정
```
- 자동 = bot detection.
- Captcha 가 LLM 못 풀.
- ToS 위반 가능 (scraping).

→ User 가 manual intervene 옵션.
또는 captcha solve service ($).
```

### Anti-detection
```
- Random delay
- Real user-agent
- Fingerprint randomize
- Residential proxy

→ ToS 위반 방향. 합법적 use case 만.
```

### Eval
```python
# Task suite (WebArena, VisualWebArena)
tasks = load_dataset('webarena')
success = 0
for t in tasks:
    result = agent.run(t)
    if check(result, t.expected):
        success += 1
print(f'Success rate: {success / len(tasks):.1%}')
```

→ 2026 SoTA: 60-80% on standard task.

### Limitations
```
- Captcha
- 매우 동적 SPA (state)
- Long task (10+ step)
- Privacy / login
- Cost (LLM call ↑)
- 부정확 (hallucinate)
```

### Observability
```ts
// Action log
log({
  step: i,
  action: action,
  screenshot: ss,
  url: page.url(),
});

// Replay later
```

→ Debug 친화.

### Real production
- **Devin** (Cognition): code agent 가 browser 도.
- **Anthropic Computer Use**: native API.
- **OpenAI Operator** (2025): browser agent product.
- **Adept ACT-1**: web action.

## 🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| Simple scrape | Playwright (no LLM) |
| Auth + dynamic | Browser agent |
| QA / E2E | Test 생성 + run |
| Research | Browser-use library |
| Production | Computer Use API |
| Cost-sensitive | Selector + tool (no vision) |
| 고난도 | Vision + multi-step |

## ❌ 안티패턴
- **Coordinate 만 (no element ID)**: brittle.
- **No max iteration**: infinite loop.
- **Login 매번 새**: cost / detection.
- **Captcha 없는 가정**: production 깨짐.
- **No log**: debug 불가.
- **ToS 무시**: 법적 risk.
- **모든 task vision**: cost.

## 🤖 LLM 활용 힌트
- Anthropic Computer Use 가 native.
- Browser-use 가 production framework.
- Element ID > coordinate.
- Accessibility tree > screenshot (cost).

## 🔗 관련 문서
- [[AI_Multi_Agent_Coordination]]
- [[AI_Tool_Composition_Deep]]
- [[Testing_Playwright_Advanced]]