We tested browser agents on 20 real websites - here's where they break

We tested browser agents on 20 real websites — here's where they break Browser agents (browser-use, Stagehand, Skyvern, Playwright-based tools) promise to automate web interactions. Login to a site, search for a product, add to cart — all autonomously. But how reliable are they actually? We measured it. The Setup We built a benchmark suite that tests whether agents correctly identify interactive elements on real production websites: 20 websites : GitHub, Amazon, Airbnb, Booking.com, eBay, LinkedIn, Stripe, Hacker News, Wikipedia, Google, Zalando, Shopify, Target, and more Ground truth : Manually annotated endpoints per site — what a human would identify as login forms, search bars, checkout buttons, navigation menus Metrics : Precision, Recall, and F1 score We didn't test agent execution (clicking, typing). We tested something more fundamental: Does the agent understand what's on the page before it acts? The Results Category Failure Rate What Goes Wrong Login/Auth ~30% miss rate Agent

We tested browser agents on 20 real websites - here's where they break

Related Articles

Pokémon Champions is coming to the Nintendo Switch on April 8th

Why You Should Start Using Negative If Statements in Your Code

Most Developers Build Software Wrong — Here’s What Actually Matters

DARVO in Text Messages: Real Examples and How to Spot It

How to Recognize Guilt-Tripping in Text Messages

Related Articles

How-To
Pokémon Champions is coming to the Nintendo Switch on April 8th
The Verge • 3h ago

How-To
Why You Should Start Using Negative If Statements in Your Code
Dev.to • 5h ago

How-To
Most Developers Build Software Wrong — Here’s What Actually Matters
Medium Programming • 6h ago

How-To
DARVO in Text Messages: Real Examples and How to Spot It
Dev.to Beginners • 6h ago

How-To
How to Recognize Guilt-Tripping in Text Messages
Dev.to Beginners • 6h ago