How I built a scraper that actually works on Cloudflare sites

via Dev.to Pythonanybrowse3h ago

I was building a research agent. It needed to read news sites, pull earnings reports, scrape job listings. Three hours in, half my URLs were returning empty strings or Cloudflare challenge pages. Not errors. Just nothing useful. That is when I realized the scraping ecosystem is mostly broken for anything that is not a static blog. Why scraping keeps failing There are three things killing most scrapers right now. JavaScript rendering. A lot of sites ship an empty HTML shell and hydrate via React or Vue. Fetch the URL directly and you get a div with an id and nothing else. Bot detection. Cloudflare, PerimeterX, DataDome -- they fingerprint your browser. Missing plugins, wrong screen resolution, suspiciously perfect mouse timing. A vanilla Playwright script fails all of these in about 30 seconds. IP reputation. Datacenter IPs are flagged before your code even runs. AWS, Hetzner, DigitalOcean -- blocked by default on half the sites worth scraping. You can fight each of these individually.

Continue reading on Dev.to Python

Opens in a new tab

Read Full Article

2 views

How I built a scraper that actually works on Cloudflare sites

Related Articles

Deep dive — Building a local physics-informed ML workflow for fluid simulations

Stop Struggling with PDFs in Flutter — Here’s Everything You Need to Know

Statistical Edge: How to Know If Your Strategy Actually Works

Vibe Coding: When Software Became A Conversation, Not Code

How I Won the MTD Marathon 2026 — Building a Personal Diary App in Just 4 Hours