Walmart is the world‘s largest retailer with over 10,000 stores across 24 countries. Given its enormous inventory and rich product data, Walmart is an extremely valuable target for web scraping.
In this comprehensive guide, we‘ll walk through how to build a web scraper to extract Walmart product data at scale.
Overview
Here‘s a quick overview of the key steps we‘ll cover:
- Finding Products to Scrape
- Using the Walmart search API
- Parsing category pages
- Dealing with result limits
- Scraping Product Pages
- Parsing product data
- Scraping media, pricing, specs etc.
- Avoiding Blocking
- Randomizing delays
- Using proxies
- Mimicking real browsers
- Putting It All Together
- Search API → product URLs → scrape
- Handling large result sets
By the end, you‘ll have a fully functioning Walmart scraper in Python ready to extract thousands of products. Let‘s get started!
Setup
We‘ll be using Python along with several key packages:
- requests – for making HTTP requests to Walmart‘s API and webpages
- beautifulsoup4 – HTML parsing
- pandas – for data manipulation
Install them via pip:
pip install requests beautifulsoup4 pandas
We‘ll also use proxies to avoid blocks, which can be purchased from various providers.
Finding Products to Scrape
The first step is discovering product URLs or IDs to feed into our scraper. There are a couple approaches we can use:
Using the Search API
Walmart offers a search API that returns structured JSON data. We can query this API to find products matching keywords.
Let‘s try it out for "laptop":
import requests
api_url = "https://www.walmart.com/terra-firma/api/search"
params = {
"query": "laptop",
"sort": "price_low",
"page": 1,
" affiliateId": "test",
}
response = requests.get(api_url, params=params)
data = response.json()
print(data["items"][0]["productId"])
# prints a product ID, e.g. 1GY23EA#ABA
This API returns paginated results in a structured JSON format containing:
productId
– the Walmart ID for that producttitle
– name of the productdescription
– short text descriptionprice
– current price- And more…
We can iterate through pages to collect IDs and data.
One limitation is that the API only allows fetching up to 1000 results. To get more coverage, we‘ll have to use other approaches too.
Parsing Category Pages
Walmart also provides browseable category pages we can parse:
https://www.walmart.com/browse/electronics/laptops/3944_3951_132959?povid=113750++2019-11-04+15%3A05%3A24.517-06%3A00&povid=113750++2019-11-04+15%3A05%3A24.517-06%3A00&affinityOverride=default
These pages contain the product grids we see on the Walmart site.
To extract products, we‘ll use Beautiful Soup:
from bs4 import BeautifulSoup
import requests
url = "https://www.walmart.com/browse/electronics/laptops/3944_3951_132959"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
products = soup.select(".search-result-gridview-item")
for product in products:
title = product.select_one(".search-result-product-title").text
link = product.select_one(".search-result-product-title")["href"]
print(title, link)
This parses the Grid/List view products, grabbing the title and URL.
We can then feed the URLs into our product scraper.
Category pages can contain thousands of products across many pages, so this method has great coverage.
Dealing with Limits
Between the search API and category pages, we can discover 10,000s of products. But there are some limits to consider:
- Search API only allows fetching 1000 results
- Each category page has 24 pages max, ~50 products per page
So for a comprehensive scrape, we‘ll have to get creative:
- Use multiple search queries with narrowing filters
- Scrape across multiple category pages
- Expand the scope, eg. scraping all laptops across Electronics
With a bit of iteration, we can build up a large corpus of 10,000+ product URLs suitable for feeding into our scraper.
Scraping Product Pages
Once we have product URLs or IDs, we can scrape the data from the product pages themselves.
Walmart product pages have a rich set of information we can extract:
- Title, description
- Images
- Price, sales data
- Specifications
- Seller info
- Reviews
- Related products
- Stock availability
And more.
Let‘s walk through scraping some key pieces.
Scraping Product Details
Product pages contain a JavaScript object called window.__WML_REDUX_INITIAL_STATE__
with much of the structured data:
<script>
window.__WML_REDUX_INITIAL_STATE__ = {
"product": {
"id": "1GY23EA#ABA",
"usItemId": "497219257",
"name": "HP 14-inch Laptop, Intel Core i3-1005G1, 4GB SDRAM, 128GB SSD, Pale Gold, Windows 10 Home",
"description": "A laptop with the performance you need and..."
...
}
...
}
</script>
We can extract this and parse the JSON to get fields like:
import json
import requests
from bs4 import BeautifulSoup
product_url = "https://www.walmart.com/ip/497219257"
response = requests.get(product_url)
soup = BeautifulSoup(response.text, ‘html.parser‘)
data = soup.find("script", {"id": "__WML_REDUX_INITIAL_STATE__"})
product_data = json.loads(data.contents[0])["product"]
title = product_data["name"]
walmart_id = product_data["usItemId"]
description = product_data["description"]
print(title)
# "HP 14-inch Laptop, Intel Core i3-1005G1, 4GB SDRAM, 128GB SSD, Pale Gold, Windows 10 Home"
This JSON field contains most of the core product information we would want to extract.
Scraping Media
The product media like images are contained in another script block, imageAssets
:
<script>
window.__WML_REDUX_INITIAL_STATE__.pdpData.item.imageAssets = [
{
"assetSize": "medium",
"assetUrl": "https://i5.walmartimages.com/...",
"baseAsset": {...},
"thumbnailUrl": "https://i5.walmartimages.com/..."
},
{...}
];
</script>
We can scrape and iterate through the assets to find the URLs of different sizes:
images = []
for asset in product_data["imageAssets"]:
img_url = asset["assetUrl"]
images.append(img_url)
print(images[0])
# "https://i5.walmartimages.com/asr/e95444a3-2e8b-41d2-a585-4f3ea9fc51b6.345fba144e9df8a6d290b2ed3857e90b.jpeg"
This allows us to get all the product images at different resolutions.
Scraping Price and Inventory
For key details like pricing and availability, the data is contained in yet another script tag:
<script>
window.__WML_REDUX_INITIAL_STATE__.pdpData.item.priceInfo = {
"priceDisplayCodes": {
"rollback": true,
"reducedPrice": true
},
"currentPrice": {
"currencyUnit": "USD",
"price": 399
}
...
We can parse out the pricing fields:
price_data = product_data["priceInfo"]
regular_price = price_data["wasPrice"]["price"] # 499
sale_price = price_data["currentPrice"]["price"] # 399
on_sale = "rollback" in price_data["priceDisplayCodes"] # True
print(f"On sale for {sale_price}, regular {regular_price}")
And similarly for stock status, contained in the availabilityStatus
:
in_stock = product_data["availabilityStatus"] == "IN_STOCK"
Putting this all together, we can build scrapers for product details, media, pricing, inventory, and more!
Avoiding Blocks
When scraping Walmart at scale, we‘ll likely encounter blocks from too many requests. Here are some tips to avoid this:
Limit request rate – stick to 2-3 requests per second max
Randomize delays – insert random 2-5 second delays between requests
Rotate user agents – spoof different desktop browser user agents
Use proxies – route traffic through residential proxy services
Retry on blocks – if blocked, pause scraping for 30+ mins
With these precautions, we can scrape thousands of Walmart products safely.
Some paid proxy services also offer advanced rotating IPs and headers to avoid blocks. These can help for larger scale scraping.
Putting It All Together
Finally, let‘s tie together the key components into a complete Walmart web scraper.
The general flow will be:
- Discover products using search API and category pages
- Collect product URLs
- Iterate through URLs to scrape each product page
- Extract details, media, pricing, inventory etc.
- Save scraped product data to CSV/JSON
Here‘s example code:
from bs4 import BeautifulSoup
import requests, json, time, random
# Product URL extraction functions...
def scrape_search_api(query):
# Search API logic...
def scrape_category_pages(url):
# Category parsing logic...
product_urls = []
product_urls.extend(scrape_search_api("laptops"))
product_urls.extend(scrape_category_pages("https://www..."))
# Add proxies here...
for url in product_urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, ‘html.parser‘)
# Extract product data...
product = {
"name": name,
"description": description,
"price": price,
"images": images,
"in_stock": in_stock
}
# Save product to CSV, database, etc...
# Random delay
time.sleep(random.uniform(2, 5))
This implements the key pieces we‘ve covered:
- Generating product URLs to feed into the scraper
- Parsing each product page with BeautifulSoup
- Extracting details, media, pricing, inventory
- Adding proxies and random delays to avoid blocks
- Saving scraped data to file
With this structure, we can scrape and extract thousands of Walmart products robustly.
The full code would contain more advanced error handling, multithreading etc. But this covers the core logic and workflow.
Summary
In this guide we walked through building a comprehensive web scraper for Walmart product data using Python.
The key techniques included:
- Using Walmart‘s search API and category pages to generate product URLs
- Parsing product pages and extracting details, media, pricing and inventory
- Avoiding blocks with proxies, delays, and spoofing
- Tying together search → product scrape → save workflow
These approaches can extract thousands of Walmart products robustly. The data can then be used for price monitoring, market research, dropshipping and more.
With a few enhancements like multithreading and database storage, you‘ll have a powerful Walmart scraping solution ready for large-scale deployment.