
I scraped 800 products and got garbage data. Here's what fixed it
I scraped 800 products and got garbage data. Here's what fixed it Scraped an e-commerce site last week for product prices. Got 823 rows back. Felt productive until I opened the CSV and saw stuff like "$19.99\n\n " and "Price: $24.99 (was $29.99)" in the same column. Zero consistency. Fun times. The mess Thought I could just grab .find('span', class_='price').text and call it done. Nope. The site had like 4 different price formats: Regular price: <span class="price">$19.99</span> Sale price: <span class="price"><strike>$29.99</strike> $19.99</span> Out of stock: <span class="price">Unavailable</span> Random whitespace everywhere: <span class="price">\n $19.99\n </span> Plus some products had prices buried in JavaScript instead of HTML. Those came back as empty strings. My first attempt: from bs4 import BeautifulSoup import requests response = requests . get ( url ) soup = BeautifulSoup ( response . text , ' html.parser ' ) prices = [] for product in soup . find_all ( ' div ' , class_ =
Continue reading on Dev.to Tutorial
Opens in a new tab




