Introduction
Web scraping is the process of automatically extracting data from websites. As an e-commerce leader and the world‘s largest company by revenue, Walmart.com is a treasure trove of valuable data for everyone from individual bargain hunters to large enterprises.
In this guide, we‘ll walk through how to leverage Python to programmatically scrape product information from Walmart.com, including:
- Product names
- Prices
- Images
- Descriptions
Whether you want to monitor prices, get notified when high-demand items are back in stock, analyze a competitor, or gain market insights, web scraping empowers you to efficiently collect the Walmart data you need at scale.
While we‘ll be focusing on Walmart, the same techniques can be adapted for any major e-commerce site. So let‘s dive in and learn how to scrape some data!
Getting Started
We‘ll be using Python 3 for scraping Walmart.com. Our scraping toolkit will consist of:
- requests: for making HTTP requests to fetch web pages
- BeautifulSoup: for parsing HTML and XML documents
- json: for working with JSON data
You can install these dependencies with pip:
$ pip install requests beautifulsoup4
We‘ll also use Python‘s built-in json module which doesn‘t require any installation.
With our environment set up, we‘re ready to fetch some pages from Walmart.com! But first, a word of caution…
A Note on Web Scraping Etiquette
While web scraping itself is not illegal, some websites explicitly prohibit it in their terms of service. Even if scraping is permitted, bombarding a site with a high volume of rapid-fire requests is considered abusive and can get your IP address banned.
As we proceed with scraping Walmart.com, we‘ll discuss best practices like inspecting robots.txt, setting a custom user agent string, throttling requests, and using proxies to stay in Walmart‘s good graces. Web scraping is a power that should be wielded responsibly.
Fetching Walmart.com Product Pages
To scrape product information from Walmart, the first step is to fetch the HTML of a product page. The URL structure follows this pattern:
https://www.walmart.com/ip/[product-name]/[product-id]
For example, here‘s a URL for a HP Chromebook:
https://www.walmart.com/ip/HP-11-6-Chromebook-AMD-A4-4GB-RAM-32GB-Storage-Black-16W64UT-ABA/592161882
Let‘s use requests to fetch this page in Python:
import requests
url = "https://www.walmart.com/ip/HP-11-6-Chromebook-AMD-A4-4GB-RAM-32GB-Storage-Black-16W64UT-ABA/592161882"
response = requests.get(url)
print(response.text)
When you run this code, you‘ll probably see the HTML for an "Access Denied" page in your console. Walmart has sophisticated bot protection in place and recognizes that the request is coming from an automated script instead of a real browser.
To get around this, we need to make our scraper look more human by setting a User-Agent
header that mimics a real web browser. You can find your browser‘s user agent string by Googling "what is my user agent".
Here‘s the same request with a Firefox user agent passed in the headers:
headers = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
}
response = requests.get(url, headers=headers)
With this change, you should now see the actual product page HTML printed to your console. We successfully tricked Walmart into thinking our request came from a browser!
Parsing Product Information with BeautifulSoup
Now that we have the raw HTML, we need to extract the relevant product details. We‘ll use BeautifulSoup to parse the HTML document and locate our data points.
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
Let‘s start by finding the product name. If you inspect the page source in your browser, you‘ll see that the title is in an
tag with a few attributes:
HP 11.6" Chromebook, AMD A4, 4GB RAM, 32GB Storage, Black 16W64UT#ABA
We can use BeautifulSoup‘s find()
method to select this element:
product_name = soup.find("h1", itemprop="name").get_text().strip()
print(product_name)
For the price, we‘ll target this tag:
$129
Here‘s how to parse it:
product_price = soup.find("span", itemprop="price").get_text()
print(product_price)
Product images are trickier because there are thumbnails and multiple image sizes to choose from. To keep things simple, we‘ll just save the URLs for the default main product image. It‘s located in this HTML:
<img loading="lazy" alt="HP 11.6" Chromebook, AMD A4, 4GB RAM, 32GB Storage, Black 16W64UT#ABA" class="max-w-full max-h-full p-auto pa3" src="https://i5.walmartimages.com/asr/ab3f8b97-355a-4304-928f-e0e4ab60def7.a74c83dffb4ee3d298e0625b4b72bf59.jpeg?odnHeight=612&odnWidth=612&odnBg=FFFFFF">
To extract the image source URL:
productimage = soup.find("img", class="max-w-full max-h-full p-auto pa3")[‘src‘]
print(product_image)
Finally, let‘s grab the product description. This is where things get a bit hairy. If you inspect the page, you‘ll notice the description isn‘t contained in the static HTML served by Walmart. Instead, it‘s rendered dynamically by JavaScript.
Luckily, we can still access this data. If you view the page source and search for the product description, you‘ll find it buried inside a big JSON object within a tag.
Rather than use BeautifulSoup, we can use regex to extract the JSON:
import re
import json
data = re.search(r‘(.+?)</script‘,
response.text,
flags=re.MULTILINE|re.DOTALL)
json_data = json.loads(data.group(1))
product_description = json_data[‘description‘]
print(product_description)
The product description is returned with HTML tags intact. If you want just the plain text, you can use BeautifulSoup again to parse and extract it:
description_soup = BeautifulSoup(product_description, "html.parser")
print(description_soup.get_text())
Putting it all together, here‘s the complete code for scraping a single Walmart product page:
import requests
from bs4 import BeautifulSoup
import re
import json
headers = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
product_name = soup.find("h1", attrs={"itemprop": "name"}).get_text().strip()
print(product_name)
product_price = soup.find("span", itemprop="price").get_text()
print(product_price)
productimage = soup.find("img", class="max-w-full max-h-full p-auto pa3")["src"] print(product_image)
data = re.search(r‘(.+?)</script‘,
response.text,
flags=re.MULTILINE|re.DOTALL)
json_data = json.loads(data.group(1))
product_description = json_data[‘description‘]
description_soup = BeautifulSoup(product_description, "html.parser")
print(description_soup.get_text())
Scaling Up Your Walmart.com Scraper
To monitor a large catalog of products or scrape multiple categories on Walmart.com, you‘ll need to scale up your scraper to automatically fetch and parse many URLs.
Some key considerations and techniques for taking your Walmart scraper to the next level:
Saving Scraped Walmart Data
Instead of just printing out the scraped data, you‘ll likely want to save the product details to a file or database. You can use Python‘s built-in csv module to create spreadsheets or a MySQL or MongoDB client library to save to a database.
Handling Pagination
Most Walmart.com category pages have a "Load More" button that lazy loads products via JS as you scroll instead of linking to pages like /page2, /page3, etc. To scrape every product in a category, you‘ll need to simulate clicking this button.
Libraries like Selenium can automate this by remotely controlling a real browser. Then you can extract the product URLs from each page and loop through them with the techniques we covered above.
Setting Rate Limits
Crawling a site as large as Walmart requires making a huge number of requests. Going too fast will likely get you banned. Use the time module to pause your scraper between requests. A good rule of thumb is to wait at least 10-15 seconds between each request and avoid scraping during peak traffic hours.
Rotating User Agents & IP Addresses
For large scraping jobs, you‘ll need to randomize your user agent and IP address for each request to avoid detection. You can create a pool of user agent strings and free proxy servers and have your scraper randomly select one for each request.
Solving CAPTCHAs
At a certain scale, you‘ll likely start running into CAPTCHAs. Walmart has robust anti-bot measures in place. Services like 2captcha or DeathByCaptcha can be used to programmatically solve them, but you‘ll want to weigh ROI as these services can get expensive at high volumes.
Using Scraping Services
For high volume scraping, you may want to consider a professional web scraping tool like Scrapy Cloud or a managed scraping service like ScrapingBee. These handle a lot of the complexity around IP rotation, CAPTCHAs, and parsing JavaScript heavy sites like Walmart.com out of the box.
Final Words
We covered a lot! You now have a solid foundation for scraping product data from Walmart.com using Python. You learned how to make requests, parse HTML and JSON responses with BeautifulSoup, and extract the key product attributes like name, price, image, and description.
To take your Walmart scraper to the next level, you‘ll want to explore the techniques for scraping at scale like handling pagination, saving data, rate limiting requests, switching user agents and proxies, and dealing with CAPTCHAs.
For complex, large scale scraping jobs, you may want to leverage an existing web scraping framework like Scrapy or an AI-powered scraping service like ScrapingBee to save development time and headaches.
Whatever your use case, web scraping is an incredibly powerful tool for extracting insights and value from the world‘s largest retailers like Walmart. By following the techniques covered in this guide and remembering to always scrape ethically, you‘ll be well on your way to building your own data sets and applications powered by Walmart‘s product catalog.
Happy scraping!