How to Web Scrape Walmart.com

Walmart is the world‘s largest retailer with over 10,000 stores across 24 countries. Given its enormous inventory and rich product data, Walmart is an extremely valuable target for web scraping.

In this comprehensive guide, we‘ll walk through how to build a web scraper to extract Walmart product data at scale.

Overview

Here‘s a quick overview of the key steps we‘ll cover:

Finding Products to Scrape
- Using the Walmart search API
- Parsing category pages
- Dealing with result limits
Scraping Product Pages
- Parsing product data
- Scraping media, pricing, specs etc.
Avoiding Blocking
- Randomizing delays
- Using proxies
- Mimicking real browsers
Putting It All Together
- Search API → product URLs → scrape
- Handling large result sets

By the end, you‘ll have a fully functioning Walmart scraper in Python ready to extract thousands of products. Let‘s get started!

Setup

We‘ll be using Python along with several key packages:

requests – for making HTTP requests to Walmart‘s API and webpages
beautifulsoup4 – HTML parsing
pandas – for data manipulation

Install them via pip:

pip install requests beautifulsoup4 pandas

We‘ll also use proxies to avoid blocks, which can be purchased from various providers.

Finding Products to Scrape

The first step is discovering product URLs or IDs to feed into our scraper. There are a couple approaches we can use:

Using the Search API

Walmart offers a search API that returns structured JSON data. We can query this API to find products matching keywords.

Let‘s try it out for "laptop":

import requests

api_url = "https://www.walmart.com/terra-firma/api/search"

params = {
  "query": "laptop", 
  "sort": "price_low",
  "page": 1,
  " affiliateId": "test",
}

response = requests.get(api_url, params=params)
data = response.json()

print(data["items"][0]["productId"])
# prints a product ID, e.g. 1GY23EA#ABA

This API returns paginated results in a structured JSON format containing:

productId – the Walmart ID for that product
title – name of the product
description – short text description
price – current price
And more…

We can iterate through pages to collect IDs and data.

One limitation is that the API only allows fetching up to 1000 results. To get more coverage, we‘ll have to use other approaches too.

Parsing Category Pages

Walmart also provides browseable category pages we can parse:

https://www.walmart.com/browse/electronics/laptops/3944_3951_132959?povid=113750++2019-11-04+15%3A05%3A24.517-06%3A00&povid=113750++2019-11-04+15%3A05%3A24.517-06%3A00&affinityOverride=default

These pages contain the product grids we see on the Walmart site.

To extract products, we‘ll use Beautiful Soup:

from bs4 import BeautifulSoup
import requests

url = "https://www.walmart.com/browse/electronics/laptops/3944_3951_132959"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

products = soup.select(".search-result-gridview-item")

for product in products:
  title = product.select_one(".search-result-product-title").text
  link = product.select_one(".search-result-product-title")["href"]

  print(title, link)

This parses the Grid/List view products, grabbing the title and URL.

We can then feed the URLs into our product scraper.

Category pages can contain thousands of products across many pages, so this method has great coverage.

Dealing with Limits

Between the search API and category pages, we can discover 10,000s of products. But there are some limits to consider:

Search API only allows fetching 1000 results
Each category page has 24 pages max, ~50 products per page

So for a comprehensive scrape, we‘ll have to get creative:

Use multiple search queries with narrowing filters
Scrape across multiple category pages
Expand the scope, eg. scraping all laptops across Electronics

With a bit of iteration, we can build up a large corpus of 10,000+ product URLs suitable for feeding into our scraper.

Scraping Product Pages

Once we have product URLs or IDs, we can scrape the data from the product pages themselves.

Walmart product pages have a rich set of information we can extract:

Title, description
Images
Price, sales data
Specifications
Seller info
Reviews
Related products
Stock availability

And more.

Let‘s walk through scraping some key pieces.

Scraping Product Details

Product pages contain a JavaScript object called window.__WML_REDUX_INITIAL_STATE__ with much of the structured data:

<script>
  window.__WML_REDUX_INITIAL_STATE__ = {
    "product": {
      "id": "1GY23EA#ABA",
      "usItemId": "497219257", 
      "name": "HP 14-inch Laptop, Intel Core i3-1005G1, 4GB SDRAM, 128GB SSD, Pale Gold, Windows 10 Home",
      "description": "A laptop with the performance you need and..."
      ...
    }
    ...
  }
</script>

We can extract this and parse the JSON to get fields like:

import json
import requests
from bs4 import BeautifulSoup

product_url = "https://www.walmart.com/ip/497219257" 

response = requests.get(product_url)
soup = BeautifulSoup(response.text, ‘html.parser‘)

data = soup.find("script", {"id": "__WML_REDUX_INITIAL_STATE__"})
product_data = json.loads(data.contents[0])["product"]

title = product_data["name"]
walmart_id = product_data["usItemId"]
description = product_data["description"]

print(title)
# "HP 14-inch Laptop, Intel Core i3-1005G1, 4GB SDRAM, 128GB SSD, Pale Gold, Windows 10 Home"

This JSON field contains most of the core product information we would want to extract.

Scraping Media

The product media like images are contained in another script block, imageAssets:

<script>
window.__WML_REDUX_INITIAL_STATE__.pdpData.item.imageAssets = [
  {
    "assetSize": "medium",
    "assetUrl": "https://i5.walmartimages.com/...", 
    "baseAsset": {...},
    "thumbnailUrl": "https://i5.walmartimages.com/..." 
  },
  {...}
];
</script>

We can scrape and iterate through the assets to find the URLs of different sizes:

images = []

for asset in product_data["imageAssets"]:
  img_url = asset["assetUrl"]
  images.append(img_url)

print(images[0])
# "https://i5.walmartimages.com/asr/e95444a3-2e8b-41d2-a585-4f3ea9fc51b6.345fba144e9df8a6d290b2ed3857e90b.jpeg"

This allows us to get all the product images at different resolutions.

Scraping Price and Inventory

For key details like pricing and availability, the data is contained in yet another script tag:

<script>
window.__WML_REDUX_INITIAL_STATE__.pdpData.item.priceInfo =  {
  "priceDisplayCodes": {
    "rollback": true,
    "reducedPrice": true    
  },
  "currentPrice": {
    "currencyUnit": "USD", 
    "price": 399
  }
  ...

We can parse out the pricing fields:

price_data = product_data["priceInfo"]

regular_price = price_data["wasPrice"]["price"] # 499 
sale_price = price_data["currentPrice"]["price"] # 399
on_sale = "rollback" in price_data["priceDisplayCodes"] # True

print(f"On sale for {sale_price}, regular {regular_price}")

And similarly for stock status, contained in the availabilityStatus:

in_stock = product_data["availabilityStatus"] == "IN_STOCK"

Putting this all together, we can build scrapers for product details, media, pricing, inventory, and more!

Avoiding Blocks

When scraping Walmart at scale, we‘ll likely encounter blocks from too many requests. Here are some tips to avoid this:

Limit request rate – stick to 2-3 requests per second max
Randomize delays – insert random 2-5 second delays between requests
Rotate user agents – spoof different desktop browser user agents
Use proxies – route traffic through residential proxy services
Retry on blocks – if blocked, pause scraping for 30+ mins

With these precautions, we can scrape thousands of Walmart products safely.

Some paid proxy services also offer advanced rotating IPs and headers to avoid blocks. These can help for larger scale scraping.

Putting It All Together

Finally, let‘s tie together the key components into a complete Walmart web scraper.

The general flow will be:

Discover products using search API and category pages
Collect product URLs
Iterate through URLs to scrape each product page
Extract details, media, pricing, inventory etc.
Save scraped product data to CSV/JSON

Here‘s example code:

from bs4 import BeautifulSoup
import requests, json, time, random

# Product URL extraction functions...

def scrape_search_api(query):
  # Search API logic...

def scrape_category_pages(url):
  # Category parsing logic...  

product_urls = []

product_urls.extend(scrape_search_api("laptops"))
product_urls.extend(scrape_category_pages("https://www...")) 

# Add proxies here...

for url in product_urls:

  response = requests.get(url)

  soup = BeautifulSoup(response.text, ‘html.parser‘)

  # Extract product data...

  product = {
    "name": name,
    "description": description,
    "price": price,
    "images": images,
    "in_stock": in_stock
  }

  # Save product to CSV, database, etc...

  # Random delay  
  time.sleep(random.uniform(2, 5))

This implements the key pieces we‘ve covered:

Generating product URLs to feed into the scraper
Parsing each product page with BeautifulSoup
Extracting details, media, pricing, inventory
Adding proxies and random delays to avoid blocks
Saving scraped data to file

With this structure, we can scrape and extract thousands of Walmart products robustly.

The full code would contain more advanced error handling, multithreading etc. But this covers the core logic and workflow.

Summary

In this guide we walked through building a comprehensive web scraper for Walmart product data using Python.

The key techniques included:

Using Walmart‘s search API and category pages to generate product URLs
Parsing product pages and extracting details, media, pricing and inventory
Avoiding blocks with proxies, delays, and spoofing
Tying together search → product scrape → save workflow

These approaches can extract thousands of Walmart products robustly. The data can then be used for price monitoring, market research, dropshipping and more.

With a few enhancements like multithreading and database storage, you‘ll have a powerful Walmart scraping solution ready for large-scale deployment.