--- id: ai-browser-agent-patterns title: Browser Agent — Playwright / Puppeteer / browser-use category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, agent, browser, vibe-coding] tech_stack: { language: "TS / Python", applicable_to: ["AI"] } applied_in: [] aliases: [browser agent, web agent, Playwright agent, browser-use, Computer Use, accessibility tree] --- # Browser Agent > LLM 가 browser 사용 — click, type, scroll. **Anthropic Computer Use, browser-use, Playwright + LLM**. Web automation 의 모던. ## 📖 핵심 개념 - Screenshot 또는 accessibility tree 가 input. - LLM 가 action 결정 (click x,y / type / scroll). - Loop until task done. - 신뢰성 / 비용 / 속도 trade-off. ## 💻 코드 패턴 ### Playwright + LLM (간단) ```ts import { chromium } from 'playwright'; const browser = await chromium.launch(); const page = await browser.newPage(); await page.goto('https://example.com'); // Screenshot → LLM const screenshot = await page.screenshot(); const action = await llm.complete({ system: 'You are a browser agent. Output JSON: {action: click|type|scroll, ...}', messages: [ { role: 'user', content: [ { type: 'image', source: { type: 'base64', data: screenshot.toString('base64') } }, { type: 'text', text: 'Search for "hello world"' }, ]}, ], }); // Execute if (action.action === 'click') await page.mouse.click(action.x, action.y); if (action.action === 'type') await page.keyboard.type(action.text); ``` ### Anthropic Computer Use ```ts import Anthropic from '@anthropic-ai/sdk'; const r = await client.messages.create({ model: 'claude-opus-4-7', tools: [{ type: 'computer_20241022', name: 'computer', display_width_px: 1024, display_height_px: 768, }], messages: [{ role: 'user', content: [ { type: 'image', source: { ... } }, { type: 'text', text: 'Find the login button and click it' }, ], }], }); // r.content 가 tool_use → execute for (const c of r.content) { if (c.type === 'tool_use' && c.name === 'computer') { const { action, coordinate } = c.input; if (action === 'left_click') await page.mouse.click(...coordinate); // ... } } ``` → Claude 가 native browser tool. ### browser-use (Python framework) ```python from browser_use import Agent from langchain.chat_models import ChatOpenAI agent = Agent( task='Find the cheapest flight from Seoul to Tokyo on Jun 1', llm=ChatOpenAI(model='gpt-4o'), ) result = await agent.run() ``` → Library 가 loop / accessibility / 안정성 처리. ### Accessibility tree (DOM 기반) ```ts const snapshot = await page.accessibility.snapshot(); // { name: 'Page', children: [{ role: 'button', name: 'Login' }, ...] } // LLM 에 ax tree 전달 const action = await llm.complete({ prompt: `Tree: ${JSON.stringify(snapshot)}\nTask: ...`, }); ``` → Screenshot 보다 정확. Vision model 안 필요. ### Element ID assignment ```ts // 매 element 에 ID 추가 → LLM 가 ID 로 click. await page.evaluate(() => { document.querySelectorAll('button, a, input').forEach((el, i) => { el.setAttribute('data-agent-id', i); }); }); // Screenshot + label 가 visible // LLM: "click element with id 5" await page.click('[data-agent-id="5"]'); ``` → Coordinate 가 brittle (resize). ID 가 stable. ### Selector strategy ```ts // LLM 가 CSS selector 생성 const action = await llm.complete({ prompt: `Click the "Subscribe" button. Output: {selector}`, }); await page.click(action.selector); // ❌ "button:nth-child(3)" — brittle // ✅ "button:has-text('Subscribe')" — semantic ``` → Playwright 의 semantic selector 가 robust. ### Loop until task done ```ts for (let i = 0; i < 50; i++) { const screenshot = await page.screenshot(); const action = await llm.complete({ ... }); if (action.type === 'done') break; await execute(action); await page.waitForLoadState('networkidle'); } ``` → Max iteration 제한 — infinite loop 방지. ### Form filling ```ts // LLM extract form fields const fields = await page.evaluate(() => [...document.querySelectorAll('input, select, textarea')].map(el => ({ selector: el.outerHTML, type: el.type, })) ); const fills = await llm.complete({ prompt: `Fill form for "Alice, alice@x.com": ${JSON.stringify(fields)}`, }); for (const fill of fills) { await page.fill(fill.selector, fill.value); } ``` ### Multi-step task ``` "Order pizza": 1. Open URL 2. Click "Sign in" 3. Type email + password 4. Navigate to menu 5. Add pizza to cart 6. Checkout 7. Confirm → 매 step 가 LLM call. ``` ### Error handling ```ts try { await page.click(selector, { timeout: 5000 }); } catch (e) { // Element 가 없거나 안 visible const screenshot = await page.screenshot(); const action = await llm.complete({ prompt: `Click failed: ${e.message}. Current screen: [image]. What to do?`, }); // → Retry / scroll / different selector } ``` ### Vision (multimodal) ```ts // GPT-4V / Claude / Gemini 가 screenshot 본다. const r = await llm.complete({ messages: [ { role: 'user', content: [ { type: 'image', source: { type: 'base64', data: ss.toString('base64') } }, { type: 'text', text: 'Find the login button. Output coordinate.' }, ]}, ], }); ``` → Vision 가 큰 cost ↑. ### 비용 ``` 1 task ≈ 10-100 LLM call. 매 call = $0.01 - $0.10 (vision = 더). Task = $0.10 - $10. → E-commerce automation 가능. 1 click 의 $. ``` ### Speed ``` LLM call 1-5 sec. 1 task = 30 sec - 5 min. → Human 보다 X 빠름. 24/7 + 병렬. ``` ### Use case ``` - Web scraping 의 새 (auth + dynamic UI) - E2E test 작성 (LLM 가 test 생성) - QA bot ("X feature broken?") - Form submission automation - Personal assistant (book ticket) - Research agent (visit 5 site, summarize) ``` ### Browser-use 의 idea ``` - DOM tree 가 input - Element 가 numbered - LLM: "click 5" - Browser: id 5 의 element 가 무엇? → execute → Coordinate brittleness 해결. ``` ### Sandbox ```ts // Untrusted user input → sandboxed browser const browser = await chromium.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox'], }); ``` → Container / VM 가 안전. ### Persistence ```ts const context = await browser.newContext({ storageState: 'auth.json', // 옛 cookie 사용 }); const page = await context.newPage(); // → 로그인 상태 유지 ``` → 매 task 마다 login X. ### Captcha 함정 ``` - 자동 = bot detection. - Captcha 가 LLM 못 풀. - ToS 위반 가능 (scraping). → User 가 manual intervene 옵션. 또는 captcha solve service ($). ``` ### Anti-detection ``` - Random delay - Real user-agent - Fingerprint randomize - Residential proxy → ToS 위반 방향. 합법적 use case 만. ``` ### Eval ```python # Task suite (WebArena, VisualWebArena) tasks = load_dataset('webarena') success = 0 for t in tasks: result = agent.run(t) if check(result, t.expected): success += 1 print(f'Success rate: {success / len(tasks):.1%}') ``` → 2026 SoTA: 60-80% on standard task. ### Limitations ``` - Captcha - 매우 동적 SPA (state) - Long task (10+ step) - Privacy / login - Cost (LLM call ↑) - 부정확 (hallucinate) ``` ### Observability ```ts // Action log log({ step: i, action: action, screenshot: ss, url: page.url(), }); // Replay later ``` → Debug 친화. ### Real production - **Devin** (Cognition): code agent 가 browser 도. - **Anthropic Computer Use**: native API. - **OpenAI Operator** (2025): browser agent product. - **Adept ACT-1**: web action. ## 🤔 의사결정 기준 | 작업 | 추천 | |---|---| | Simple scrape | Playwright (no LLM) | | Auth + dynamic | Browser agent | | QA / E2E | Test 생성 + run | | Research | Browser-use library | | Production | Computer Use API | | Cost-sensitive | Selector + tool (no vision) | | 고난도 | Vision + multi-step | ## ❌ 안티패턴 - **Coordinate 만 (no element ID)**: brittle. - **No max iteration**: infinite loop. - **Login 매번 새**: cost / detection. - **Captcha 없는 가정**: production 깨짐. - **No log**: debug 불가. - **ToS 무시**: 법적 risk. - **모든 task vision**: cost. ## 🤖 LLM 활용 힌트 - Anthropic Computer Use 가 native. - Browser-use 가 production framework. - Element ID > coordinate. - Accessibility tree > screenshot (cost). ## 🔗 관련 문서 - [[AI_Multi_Agent_Coordination]] - [[AI_Tool_Composition_Deep]] - [[Testing_Playwright_Advanced]]