[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,373 @@
|
||||
---
|
||||
id: ai-browser-agent-patterns
|
||||
title: Browser Agent — Playwright / Puppeteer / browser-use
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [ai, agent, browser, vibe-coding]
|
||||
tech_stack: { language: "TS / Python", applicable_to: ["AI"] }
|
||||
applied_in: []
|
||||
aliases: [browser agent, web agent, Playwright agent, browser-use, Computer Use, accessibility tree]
|
||||
---
|
||||
|
||||
# Browser Agent
|
||||
|
||||
> LLM 가 browser 사용 — click, type, scroll. **Anthropic Computer Use, browser-use, Playwright + LLM**. Web automation 의 모던.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- Screenshot 또는 accessibility tree 가 input.
|
||||
- LLM 가 action 결정 (click x,y / type / scroll).
|
||||
- Loop until task done.
|
||||
- 신뢰성 / 비용 / 속도 trade-off.
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### Playwright + LLM (간단)
|
||||
```ts
|
||||
import { chromium } from 'playwright';
|
||||
|
||||
const browser = await chromium.launch();
|
||||
const page = await browser.newPage();
|
||||
await page.goto('https://example.com');
|
||||
|
||||
// Screenshot → LLM
|
||||
const screenshot = await page.screenshot();
|
||||
const action = await llm.complete({
|
||||
system: 'You are a browser agent. Output JSON: {action: click|type|scroll, ...}',
|
||||
messages: [
|
||||
{ role: 'user', content: [
|
||||
{ type: 'image', source: { type: 'base64', data: screenshot.toString('base64') } },
|
||||
{ type: 'text', text: 'Search for "hello world"' },
|
||||
]},
|
||||
],
|
||||
});
|
||||
|
||||
// Execute
|
||||
if (action.action === 'click') await page.mouse.click(action.x, action.y);
|
||||
if (action.action === 'type') await page.keyboard.type(action.text);
|
||||
```
|
||||
|
||||
### Anthropic Computer Use
|
||||
```ts
|
||||
import Anthropic from '@anthropic-ai/sdk';
|
||||
|
||||
const r = await client.messages.create({
|
||||
model: 'claude-opus-4-7',
|
||||
tools: [{
|
||||
type: 'computer_20241022',
|
||||
name: 'computer',
|
||||
display_width_px: 1024,
|
||||
display_height_px: 768,
|
||||
}],
|
||||
messages: [{
|
||||
role: 'user',
|
||||
content: [
|
||||
{ type: 'image', source: { ... } },
|
||||
{ type: 'text', text: 'Find the login button and click it' },
|
||||
],
|
||||
}],
|
||||
});
|
||||
|
||||
// r.content 가 tool_use → execute
|
||||
for (const c of r.content) {
|
||||
if (c.type === 'tool_use' && c.name === 'computer') {
|
||||
const { action, coordinate } = c.input;
|
||||
if (action === 'left_click') await page.mouse.click(...coordinate);
|
||||
// ...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
→ Claude 가 native browser tool.
|
||||
|
||||
### browser-use (Python framework)
|
||||
```python
|
||||
from browser_use import Agent
|
||||
from langchain.chat_models import ChatOpenAI
|
||||
|
||||
agent = Agent(
|
||||
task='Find the cheapest flight from Seoul to Tokyo on Jun 1',
|
||||
llm=ChatOpenAI(model='gpt-4o'),
|
||||
)
|
||||
result = await agent.run()
|
||||
```
|
||||
|
||||
→ Library 가 loop / accessibility / 안정성 처리.
|
||||
|
||||
### Accessibility tree (DOM 기반)
|
||||
```ts
|
||||
const snapshot = await page.accessibility.snapshot();
|
||||
// { name: 'Page', children: [{ role: 'button', name: 'Login' }, ...] }
|
||||
|
||||
// LLM 에 ax tree 전달
|
||||
const action = await llm.complete({
|
||||
prompt: `Tree: ${JSON.stringify(snapshot)}\nTask: ...`,
|
||||
});
|
||||
```
|
||||
|
||||
→ Screenshot 보다 정확. Vision model 안 필요.
|
||||
|
||||
### Element ID assignment
|
||||
```ts
|
||||
// 매 element 에 ID 추가 → LLM 가 ID 로 click.
|
||||
await page.evaluate(() => {
|
||||
document.querySelectorAll('button, a, input').forEach((el, i) => {
|
||||
el.setAttribute('data-agent-id', i);
|
||||
});
|
||||
});
|
||||
|
||||
// Screenshot + label 가 visible
|
||||
// LLM: "click element with id 5"
|
||||
await page.click('[data-agent-id="5"]');
|
||||
```
|
||||
|
||||
→ Coordinate 가 brittle (resize). ID 가 stable.
|
||||
|
||||
### Selector strategy
|
||||
```ts
|
||||
// LLM 가 CSS selector 생성
|
||||
const action = await llm.complete({
|
||||
prompt: `Click the "Subscribe" button. Output: {selector}`,
|
||||
});
|
||||
|
||||
await page.click(action.selector);
|
||||
// ❌ "button:nth-child(3)" — brittle
|
||||
// ✅ "button:has-text('Subscribe')" — semantic
|
||||
```
|
||||
|
||||
→ Playwright 의 semantic selector 가 robust.
|
||||
|
||||
### Loop until task done
|
||||
```ts
|
||||
for (let i = 0; i < 50; i++) {
|
||||
const screenshot = await page.screenshot();
|
||||
const action = await llm.complete({ ... });
|
||||
|
||||
if (action.type === 'done') break;
|
||||
|
||||
await execute(action);
|
||||
await page.waitForLoadState('networkidle');
|
||||
}
|
||||
```
|
||||
|
||||
→ Max iteration 제한 — infinite loop 방지.
|
||||
|
||||
### Form filling
|
||||
```ts
|
||||
// LLM extract form fields
|
||||
const fields = await page.evaluate(() =>
|
||||
[...document.querySelectorAll('input, select, textarea')].map(el => ({
|
||||
selector: el.outerHTML,
|
||||
type: el.type,
|
||||
}))
|
||||
);
|
||||
|
||||
const fills = await llm.complete({
|
||||
prompt: `Fill form for "Alice, alice@x.com": ${JSON.stringify(fields)}`,
|
||||
});
|
||||
|
||||
for (const fill of fills) {
|
||||
await page.fill(fill.selector, fill.value);
|
||||
}
|
||||
```
|
||||
|
||||
### Multi-step task
|
||||
```
|
||||
"Order pizza":
|
||||
1. Open URL
|
||||
2. Click "Sign in"
|
||||
3. Type email + password
|
||||
4. Navigate to menu
|
||||
5. Add pizza to cart
|
||||
6. Checkout
|
||||
7. Confirm
|
||||
|
||||
→ 매 step 가 LLM call.
|
||||
```
|
||||
|
||||
### Error handling
|
||||
```ts
|
||||
try {
|
||||
await page.click(selector, { timeout: 5000 });
|
||||
} catch (e) {
|
||||
// Element 가 없거나 안 visible
|
||||
const screenshot = await page.screenshot();
|
||||
const action = await llm.complete({
|
||||
prompt: `Click failed: ${e.message}. Current screen: [image]. What to do?`,
|
||||
});
|
||||
// → Retry / scroll / different selector
|
||||
}
|
||||
```
|
||||
|
||||
### Vision (multimodal)
|
||||
```ts
|
||||
// GPT-4V / Claude / Gemini 가 screenshot 본다.
|
||||
const r = await llm.complete({
|
||||
messages: [
|
||||
{ role: 'user', content: [
|
||||
{ type: 'image', source: { type: 'base64', data: ss.toString('base64') } },
|
||||
{ type: 'text', text: 'Find the login button. Output coordinate.' },
|
||||
]},
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
→ Vision 가 큰 cost ↑.
|
||||
|
||||
### 비용
|
||||
```
|
||||
1 task ≈ 10-100 LLM call.
|
||||
매 call = $0.01 - $0.10 (vision = 더).
|
||||
|
||||
Task = $0.10 - $10.
|
||||
|
||||
→ E-commerce automation 가능. 1 click 의 $.
|
||||
```
|
||||
|
||||
### Speed
|
||||
```
|
||||
LLM call 1-5 sec.
|
||||
1 task = 30 sec - 5 min.
|
||||
|
||||
→ Human 보다 X 빠름. 24/7 + 병렬.
|
||||
```
|
||||
|
||||
### Use case
|
||||
```
|
||||
- Web scraping 의 새 (auth + dynamic UI)
|
||||
- E2E test 작성 (LLM 가 test 생성)
|
||||
- QA bot ("X feature broken?")
|
||||
- Form submission automation
|
||||
- Personal assistant (book ticket)
|
||||
- Research agent (visit 5 site, summarize)
|
||||
```
|
||||
|
||||
### Browser-use 의 idea
|
||||
```
|
||||
- DOM tree 가 input
|
||||
- Element 가 numbered
|
||||
- LLM: "click 5"
|
||||
- Browser: id 5 의 element 가 무엇? → execute
|
||||
|
||||
→ Coordinate brittleness 해결.
|
||||
```
|
||||
|
||||
### Sandbox
|
||||
```ts
|
||||
// Untrusted user input → sandboxed browser
|
||||
const browser = await chromium.launch({
|
||||
args: ['--no-sandbox', '--disable-setuid-sandbox'],
|
||||
});
|
||||
```
|
||||
|
||||
→ Container / VM 가 안전.
|
||||
|
||||
### Persistence
|
||||
```ts
|
||||
const context = await browser.newContext({
|
||||
storageState: 'auth.json', // 옛 cookie 사용
|
||||
});
|
||||
const page = await context.newPage();
|
||||
// → 로그인 상태 유지
|
||||
```
|
||||
|
||||
→ 매 task 마다 login X.
|
||||
|
||||
### Captcha 함정
|
||||
```
|
||||
- 자동 = bot detection.
|
||||
- Captcha 가 LLM 못 풀.
|
||||
- ToS 위반 가능 (scraping).
|
||||
|
||||
→ User 가 manual intervene 옵션.
|
||||
또는 captcha solve service ($).
|
||||
```
|
||||
|
||||
### Anti-detection
|
||||
```
|
||||
- Random delay
|
||||
- Real user-agent
|
||||
- Fingerprint randomize
|
||||
- Residential proxy
|
||||
|
||||
→ ToS 위반 방향. 합법적 use case 만.
|
||||
```
|
||||
|
||||
### Eval
|
||||
```python
|
||||
# Task suite (WebArena, VisualWebArena)
|
||||
tasks = load_dataset('webarena')
|
||||
success = 0
|
||||
for t in tasks:
|
||||
result = agent.run(t)
|
||||
if check(result, t.expected):
|
||||
success += 1
|
||||
print(f'Success rate: {success / len(tasks):.1%}')
|
||||
```
|
||||
|
||||
→ 2026 SoTA: 60-80% on standard task.
|
||||
|
||||
### Limitations
|
||||
```
|
||||
- Captcha
|
||||
- 매우 동적 SPA (state)
|
||||
- Long task (10+ step)
|
||||
- Privacy / login
|
||||
- Cost (LLM call ↑)
|
||||
- 부정확 (hallucinate)
|
||||
```
|
||||
|
||||
### Observability
|
||||
```ts
|
||||
// Action log
|
||||
log({
|
||||
step: i,
|
||||
action: action,
|
||||
screenshot: ss,
|
||||
url: page.url(),
|
||||
});
|
||||
|
||||
// Replay later
|
||||
```
|
||||
|
||||
→ Debug 친화.
|
||||
|
||||
### Real production
|
||||
- **Devin** (Cognition): code agent 가 browser 도.
|
||||
- **Anthropic Computer Use**: native API.
|
||||
- **OpenAI Operator** (2025): browser agent product.
|
||||
- **Adept ACT-1**: web action.
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 작업 | 추천 |
|
||||
|---|---|
|
||||
| Simple scrape | Playwright (no LLM) |
|
||||
| Auth + dynamic | Browser agent |
|
||||
| QA / E2E | Test 생성 + run |
|
||||
| Research | Browser-use library |
|
||||
| Production | Computer Use API |
|
||||
| Cost-sensitive | Selector + tool (no vision) |
|
||||
| 고난도 | Vision + multi-step |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **Coordinate 만 (no element ID)**: brittle.
|
||||
- **No max iteration**: infinite loop.
|
||||
- **Login 매번 새**: cost / detection.
|
||||
- **Captcha 없는 가정**: production 깨짐.
|
||||
- **No log**: debug 불가.
|
||||
- **ToS 무시**: 법적 risk.
|
||||
- **모든 task vision**: cost.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- Anthropic Computer Use 가 native.
|
||||
- Browser-use 가 production framework.
|
||||
- Element ID > coordinate.
|
||||
- Accessibility tree > screenshot (cost).
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[AI_Multi_Agent_Coordination]]
|
||||
- [[AI_Tool_Composition_Deep]]
|
||||
- [[Testing_Playwright_Advanced]]
|
||||
Reference in New Issue
Block a user