8.4 KiB
8.4 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ai-browser-agent-patterns | Browser Agent — Playwright / Puppeteer / browser-use | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Browser Agent
LLM 가 browser 사용 — click, type, scroll. Anthropic Computer Use, browser-use, Playwright + LLM. Web automation 의 모던.
📖 핵심 개념
- Screenshot 또는 accessibility tree 가 input.
- LLM 가 action 결정 (click x,y / type / scroll).
- Loop until task done.
- 신뢰성 / 비용 / 속도 trade-off.
💻 코드 패턴
Playwright + LLM (간단)
import { chromium } from 'playwright';
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// Screenshot → LLM
const screenshot = await page.screenshot();
const action = await llm.complete({
system: 'You are a browser agent. Output JSON: {action: click|type|scroll, ...}',
messages: [
{ role: 'user', content: [
{ type: 'image', source: { type: 'base64', data: screenshot.toString('base64') } },
{ type: 'text', text: 'Search for "hello world"' },
]},
],
});
// Execute
if (action.action === 'click') await page.mouse.click(action.x, action.y);
if (action.action === 'type') await page.keyboard.type(action.text);
Anthropic Computer Use
import Anthropic from '@anthropic-ai/sdk';
const r = await client.messages.create({
model: 'claude-opus-4-7',
tools: [{
type: 'computer_20241022',
name: 'computer',
display_width_px: 1024,
display_height_px: 768,
}],
messages: [{
role: 'user',
content: [
{ type: 'image', source: { ... } },
{ type: 'text', text: 'Find the login button and click it' },
],
}],
});
// r.content 가 tool_use → execute
for (const c of r.content) {
if (c.type === 'tool_use' && c.name === 'computer') {
const { action, coordinate } = c.input;
if (action === 'left_click') await page.mouse.click(...coordinate);
// ...
}
}
→ Claude 가 native browser tool.
browser-use (Python framework)
from browser_use import Agent
from langchain.chat_models import ChatOpenAI
agent = Agent(
task='Find the cheapest flight from Seoul to Tokyo on Jun 1',
llm=ChatOpenAI(model='gpt-4o'),
)
result = await agent.run()
→ Library 가 loop / accessibility / 안정성 처리.
Accessibility tree (DOM 기반)
const snapshot = await page.accessibility.snapshot();
// { name: 'Page', children: [{ role: 'button', name: 'Login' }, ...] }
// LLM 에 ax tree 전달
const action = await llm.complete({
prompt: `Tree: ${JSON.stringify(snapshot)}\nTask: ...`,
});
→ Screenshot 보다 정확. Vision model 안 필요.
Element ID assignment
// 매 element 에 ID 추가 → LLM 가 ID 로 click.
await page.evaluate(() => {
document.querySelectorAll('button, a, input').forEach((el, i) => {
el.setAttribute('data-agent-id', i);
});
});
// Screenshot + label 가 visible
// LLM: "click element with id 5"
await page.click('[data-agent-id="5"]');
→ Coordinate 가 brittle (resize). ID 가 stable.
Selector strategy
// LLM 가 CSS selector 생성
const action = await llm.complete({
prompt: `Click the "Subscribe" button. Output: {selector}`,
});
await page.click(action.selector);
// ❌ "button:nth-child(3)" — brittle
// ✅ "button:has-text('Subscribe')" — semantic
→ Playwright 의 semantic selector 가 robust.
Loop until task done
for (let i = 0; i < 50; i++) {
const screenshot = await page.screenshot();
const action = await llm.complete({ ... });
if (action.type === 'done') break;
await execute(action);
await page.waitForLoadState('networkidle');
}
→ Max iteration 제한 — infinite loop 방지.
Form filling
// LLM extract form fields
const fields = await page.evaluate(() =>
[...document.querySelectorAll('input, select, textarea')].map(el => ({
selector: el.outerHTML,
type: el.type,
}))
);
const fills = await llm.complete({
prompt: `Fill form for "Alice, alice@x.com": ${JSON.stringify(fields)}`,
});
for (const fill of fills) {
await page.fill(fill.selector, fill.value);
}
Multi-step task
"Order pizza":
1. Open URL
2. Click "Sign in"
3. Type email + password
4. Navigate to menu
5. Add pizza to cart
6. Checkout
7. Confirm
→ 매 step 가 LLM call.
Error handling
try {
await page.click(selector, { timeout: 5000 });
} catch (e) {
// Element 가 없거나 안 visible
const screenshot = await page.screenshot();
const action = await llm.complete({
prompt: `Click failed: ${e.message}. Current screen: [image]. What to do?`,
});
// → Retry / scroll / different selector
}
Vision (multimodal)
// GPT-4V / Claude / Gemini 가 screenshot 본다.
const r = await llm.complete({
messages: [
{ role: 'user', content: [
{ type: 'image', source: { type: 'base64', data: ss.toString('base64') } },
{ type: 'text', text: 'Find the login button. Output coordinate.' },
]},
],
});
→ Vision 가 큰 cost ↑.
비용
1 task ≈ 10-100 LLM call.
매 call = $0.01 - $0.10 (vision = 더).
Task = $0.10 - $10.
→ E-commerce automation 가능. 1 click 의 $.
Speed
LLM call 1-5 sec.
1 task = 30 sec - 5 min.
→ Human 보다 X 빠름. 24/7 + 병렬.
Use case
- Web scraping 의 새 (auth + dynamic UI)
- E2E test 작성 (LLM 가 test 생성)
- QA bot ("X feature broken?")
- Form submission automation
- Personal assistant (book ticket)
- Research agent (visit 5 site, summarize)
Browser-use 의 idea
- DOM tree 가 input
- Element 가 numbered
- LLM: "click 5"
- Browser: id 5 의 element 가 무엇? → execute
→ Coordinate brittleness 해결.
Sandbox
// Untrusted user input → sandboxed browser
const browser = await chromium.launch({
args: ['--no-sandbox', '--disable-setuid-sandbox'],
});
→ Container / VM 가 안전.
Persistence
const context = await browser.newContext({
storageState: 'auth.json', // 옛 cookie 사용
});
const page = await context.newPage();
// → 로그인 상태 유지
→ 매 task 마다 login X.
Captcha 함정
- 자동 = bot detection.
- Captcha 가 LLM 못 풀.
- ToS 위반 가능 (scraping).
→ User 가 manual intervene 옵션.
또는 captcha solve service ($).
Anti-detection
- Random delay
- Real user-agent
- Fingerprint randomize
- Residential proxy
→ ToS 위반 방향. 합법적 use case 만.
Eval
# Task suite (WebArena, VisualWebArena)
tasks = load_dataset('webarena')
success = 0
for t in tasks:
result = agent.run(t)
if check(result, t.expected):
success += 1
print(f'Success rate: {success / len(tasks):.1%}')
→ 2026 SoTA: 60-80% on standard task.
Limitations
- Captcha
- 매우 동적 SPA (state)
- Long task (10+ step)
- Privacy / login
- Cost (LLM call ↑)
- 부정확 (hallucinate)
Observability
// Action log
log({
step: i,
action: action,
screenshot: ss,
url: page.url(),
});
// Replay later
→ Debug 친화.
Real production
- Devin (Cognition): code agent 가 browser 도.
- Anthropic Computer Use: native API.
- OpenAI Operator (2025): browser agent product.
- Adept ACT-1: web action.
🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| Simple scrape | Playwright (no LLM) |
| Auth + dynamic | Browser agent |
| QA / E2E | Test 생성 + run |
| Research | Browser-use library |
| Production | Computer Use API |
| Cost-sensitive | Selector + tool (no vision) |
| 고난도 | Vision + multi-step |
❌ 안티패턴
- Coordinate 만 (no element ID): brittle.
- No max iteration: infinite loop.
- Login 매번 새: cost / detection.
- Captcha 없는 가정: production 깨짐.
- No log: debug 불가.
- ToS 무시: 법적 risk.
- 모든 task vision: cost.
🤖 LLM 활용 힌트
- Anthropic Computer Use 가 native.
- Browser-use 가 production framework.
- Element ID > coordinate.
- Accessibility tree > screenshot (cost).