
We tested browser agents on 20 real websites - here's where they break
We tested browser agents on 20 real websites — here's where they break Browser agents (browser-use, Stagehand, Skyvern, Playwright-based tools) promise to automate web interactions. Login to a site, search for a product, add to cart — all autonomously. But how reliable are they actually? We measured it. The Setup We built a benchmark suite that tests whether agents correctly identify interactive elements on real production websites: 20 websites : GitHub, Amazon, Airbnb, Booking.com, eBay, LinkedIn, Stripe, Hacker News, Wikipedia, Google, Zalando, Shopify, Target, and more Ground truth : Manually annotated endpoints per site — what a human would identify as login forms, search bars, checkout buttons, navigation menus Metrics : Precision, Recall, and F1 score We didn't test agent execution (clicking, typing). We tested something more fundamental: Does the agent understand what's on the page before it acts? The Results Category Failure Rate What Goes Wrong Login/Auth ~30% miss rate Agent
Continue reading on Dev.to Webdev
Opens in a new tab



