Skip to content

How to Scrape Walmart.com with Python (Step by Step Guide)

Scraping Walmart.com can provide valuable data for various purposes. In this guide, we'll walk you through the process of scraping Walmart.com efficiently and effectively.

When it comes to web scraping Walmart.com using Python, there are several libraries available that can facilitate the process. Here are some popular Python web scraping libraries that you can use to scrape data from Walmart.com:

  • Beautiful Soup: Beautiful Soup is a widely used library for web scraping in Python. It provides convenient methods for parsing HTML and XML documents, making it easy to extract data from web pages. You can use Beautiful Soup in combination with other libraries to scrape data from Walmart.com.
  • Requests: The Requests library is commonly used for making HTTP requests in Python. It allows you to send HTTP requests to Walmart.com and retrieve the HTML content of web pages. With Requests, you can fetch the necessary web pages and then use other libraries like Beautiful Soup to parse the data.
  • Selenium: Selenium is a powerful web scraping library that enables browser automation. It can be used to interact with web pages dynamically, making it useful for scraping websites with JavaScript-based functionality. Selenium allows you to automate tasks like clicking buttons, filling forms, and navigating through pages, which can be beneficial for scraping Walmart.com.
  • Scrapy: Scrapy is a robust web scraping framework in Python. It provides a high-level, efficient, and extensible platform for scraping data from websites. Scrapy simplifies the process of building web crawlers, allowing you to scrape data from Walmart.com at scale.
  • LXML: LXML is a Python library that provides a fast and easy-to-use interface for parsing XML and HTML documents. It is commonly used in combination with Requests and Beautiful Soup to scrape data from websites. LXML offers XPath support, which allows you to extract specific elements from the HTML structure of Walmart.com.

These libraries provide different functionalities and levels of flexibility, so you can choose the one that best suits your specific scraping needs for Walmart.com. Consider exploring their documentation and examples to understand how to utilize them effectively for your scraping proje

How to Scrape Walmart.com

Step 1: Build A List Of Walmart Product URLs

When scraping Walmart.com, the first step is to build a list of product URLs. This can be done by utilizing the Walmart Search page, which returns up to 40 products per page. To generate the list of product URLs, follow these steps:

The first step in scraping Walmart.com is to design a web crawler that generates a list of product URLs to scrape. The easiest way to do this is to use the Walmart Search page, which returns up to 40 products per page. The URL for the search page contains several parameters that you can customize:

  • q is the search query, such as ipad.
  • sort is the sorting order of the query, such as best_seller.
  • page is the page number, such as 1.

Note that Walmart only returns a maximum of 25 pages. If you want more results for your query, you can be more specific with your search terms or change the sorting parameter.

The list of products returned in the response is available as hidden JSON data on the page. You just need to extract the JSON blob in the <script id="__NEXT_DATA__" type="application/json"> tag and parse it into JSON. This JSON response contains the data you're looking for.

Here is an example Python script that retrieves all products for a given keyword from all 25 pages:

import json
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode

def create_walmart_product_url(product):
    return 'https://www.walmart.com' + product.get('canonicalUrl', '').split('?')[0]

headers={"User-Agent": "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"}
product_url_list = []
keyword = 'ipad'

for page in range(1, 26):
    try:
        payload = {'q': keyword, 'sort': 'best_seller', 'page': page, 'affinityOverride': 'default'}
        walmart_search_url = 'https://www.walmart.com/search?' + urlencode(payload)
        response = requests.get(walmart_search_url, headers=headers)
        if response.status_code == 200:
            html_response = response.text
            soup = BeautifulSoup(html_response, "html.parser")
            script_tag = soup.find("script", {"id": "__NEXT_DATA__"})
            if script_tag is not None:
                json_blob = json.loads(script_tag.get_text())
                product_list = json_blob["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
                product_urls = [create_walmart_product_url(product) for product in product_list]
                product_url_list.extend(product_urls)
                if len(product_urls) == 0:
                    break
    except Exception as e:
        print('Error', e)

print(product_url_list)

The output will be a list of product URLs.

Step 2: Scraping Walmart Product Data

The Walmart Search request also returns a lot more information than just the product URLs. You can get the product name, price, image URL, rating, and the number of reviews from the JSON blob as well. Depending on what data you need, you might not need to request each product page because you can get the data from the search results.

To extract the product data from the list, you can use a function like this:

def extract_product_data(product):
    return {
        'url': create_walmart_url(product),
        'name':Based on the information I found, here's a Python script that scrapes Walmart.com for product information. The script generates a list of product URLs and then extracts product data from each URL. 

Please note that this script only works for up to 25 pages of search results per query due to Walmart's limitations. If you need to scrape more data, you'll need to modify your queries or change the sorting parameters.

Here's the Python script:

```python
import json
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode

def create_walmart_product_url(product):
    return 'https://www.walmart.com' + product.get('canonicalUrl', '').split('?')[0]

def extract_product_data(product):
    return {
        'url': create_walmart_product_url(product),
        'name': product.get('name', ''),
        'description': product.get('description', ''),
        'image_url': product.get('image', ''),
        'average_rating': product['rating'].get('averageRating'),
        'number_reviews': product['rating'].get('numberOfReviews'),
    }

headers={"User-Agent": "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"}
product_url_list = []
product_data_list = []

# Walmart Search Keyword
keyword = 'ipad'

# Loop Through Walmart Pages Until No More Products
for page in range(1, 5):
    try:
        payload = {'q': keyword, 'sort': 'best_seller', 'page': page, 'affinityOverride': 'default'}
        walmart_search_url = 'https://www.walmart.com/search?' + urlencode(payload)
        response = requests.get(walmart_search_url, headers=headers)

        if response.status_code == 200:
            html_response = response.text
            soup = BeautifulSoup(html_response, "html.parser")
            script_tag = soup.find("script", {"id": "__NEXT_DATA__"})
            if script_tag is not None:
                json_blob = json.loads(script_tag.get_text())
                product_list = json_blob["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
                product_urls = [create_walmart_product_url(product) for product in product_list]
                product_url_list.extend(product_urls)
                product_data = [extract_product_data(product) for product in product_list]
                product_data_list.extend(product_data)
                if len(product_urls) == 0:
                    break

    except Exception as e:
        print('Error', e)

print(product_url_list)
print(product_data_list)

This script will output two lists. product_url_list will contain the URLs of each product, and product_data_list will contain dictionaries with product data (name, description, image URL, average rating, and number of reviews) for each product.

Walmart Anti-Bot Protection

When scraping Walmart.com, it's essential to consider the anti-bot protection measures in place. Walmart employs various techniques to prevent automated scraping, including CAPTCHAs, rate limiting, and session-based tracking. To overcome these challenges, you can employ strategies such as:

FAQs: Frequently Asked Questions

Scraping a website like Walmart.com raises legal concerns. While scraping publicly available data may generally be permissible, it's crucial to review Walmart's terms of service and consult with legal professionals to ensure compliance with applicable laws.

Q2. How often should I scrape Walmart.com?

The frequency of scraping should be determined by the nature of your project and Walmart's policies. Excessive scraping can potentially strain Walmart's servers and violate their terms of service. Consider implementing reasonable intervals between scraping sessions to avoid disruption or potential penalties.

Q3. Can I scrape product reviews from Walmart.com?

Scraping product reviews can provide valuable insights. However, it's important to respect the privacy and intellectual property rights of users and adhere to Walmart's policies. Review Walmart's terms of service and consult legal professionals to ensure compliance when scraping product reviews.

Q4. How can I handle changes to Walmart's website structure?

Walmart.com undergoes occasional updates and redesigns, which may impact the structure of their web pages. To handle these changes, regularly monitor and adapt your scraping code. Here are a few strategies to handle website structure changes:

  • Maintain a robust scraping framework: Build a modular and flexible scraping framework that can easily accommodate changes. Separate your scraping logic from the website-specific code, making it easier to update when needed.
  • Monitor for changes: Regularly check Walmart's website for any noticeable changes in the HTML structure or CSS classes used for product information. This can be done manually or by implementing automated monitoring scripts that alert you to any modifications.
  • Use CSS selectors and XPath: Instead of relying on specific HTML element IDs or classes, utilize CSS selectors or XPath expressions to extract data. These methods are more resilient to changes in the underlying structure of the website.
  • Handle errors gracefully: Implement error handling mechanisms to handle unexpected changes in the website's structure. This could include fallback options, retry logic, or error logging to help identify and address any issues that arise.
  • Stay updated with APIs: If available, consider using Walmart's official APIs for accessing product data. APIs provide a more stable and structured way to retrieve information, as they are specifically designed to be used by developers and are less prone to frequent changes.

Remember, scraping websites is an evolving process, and you need to adapt to changes over time. Regular maintenance and monitoring will help ensure your scraping code remains effective and accurate.


Conclusion

Scraping Walmart.com can provide valuable data for various purposes, but it's important to be mindful of legal considerations and Walmart's policies. By following the steps outlined in this guide and staying vigilant for changes, you can successfully scrape Walmart.com and retrieve the desired product data for your projects.

Join the conversation

Your email address will not be published. Required fields are marked *