Skip to content

The Ultimate Guide to Scraping Amazon with Python (2023)

Scraping data from Amazon, one of the largest e-commerce platforms can provide valuable insights. Whether you're looking to compare prices, analyze customer reviews, or track product availability, web scraping can be a useful tool. This guide provides detailed instructions and Python code examples for scraping Amazon.

Web Scraping Ethics

Before we dive in, it's important to note that scraping should be done responsibly to minimize demand on the website's servers. One way to do this is by focusing on the Amazon Search page where you can extract basic product data such as name, price, image URL, rating, and number of reviews. This approach will significantly reduce the number of requests you need to make to Amazon, making your scraper faster and cheaper to run​.

Python Web Scraping Libraries

Python offers a plethora of libraries for web scraping, and choosing the right one depends on your specific needs and level of comfort with Python. Here are some of the most commonly used libraries:

  1. Requests: A popular Python library for making HTTP requests. It abstracts the complexities of making requests behind a simple API, allowing you to send HTTP/1.1 requests with various methods like GET, POST, and others​​.
  2. BeautifulSoup: It's used for parsing HTML and XML documents and extract data. It creates a parse tree from the page source code that can be used to extract data in a hierarchical and more readable manner​​.
  3. Scrapy: An open-source Python framework designed specifically for web scraping. It's a versatile framework that can handle a wide range of scraping tasks and is capable of scraping large data sets​​.
  4. Selenium: A powerful tool for controlling a web browser through the program. It's very handy for web scraping because it can handle all types of website content, including JavaScript generated content. It also allows for user interaction like clicking, scrolling, etc​​.
  5. Parsel: Used for extracting data from HTML and XML using XPath and CSS selectors. It's built on top of the lxml library, making it flexible and easy to use​​.

Scraping Product Data from Amazon Search Pages

The first step in scraping Amazon is to extract data from the search pages. The Python Requests and Parsel libraries can be used for this task. Here's an example script that scrapes product data from all available Amazon Search Pages for a given keyword (e.g., ‘iPad'):

import requests   
from parsel import Selector   
from urllib.parse import urljoin   

keyword_list = ['ipad']  
product_overview_data = []  

for keyword in keyword_list:  
    url_list = [f'https://www.amazon.com/s?k={keyword}&page=1']  
    for url in url_list:  
        try:  
            response = requests.get(url)  

            if response.status_code == 200:  
                sel = Selector(text=response.text)  

                # Extract Product Page  
                search_products = sel.css("div.s-result-item[data-component-type=s-search-result]")  
                for product in search_products:  
                    relative_url = product.css("h2>a::attr(href)").get()  
                    asin = relative_url.split('/')[3] if len(relative_url.split('/')) >= 4 else None  
                    product_url = urljoin('https://www.amazon.com/', relative_url).split("?")[0]  
                    product_overview_data.append(  
                        {  
                            "keyword": keyword,  
                            "asin": asin,  
                            "url": product_url,  
                            "ad": True if "/slredirect/" in product_url else False,   
                            "title": product.css("h2>a>span::text").get(),  
                            "price-data-a-size=xl .a-offscreen::text").get(),  
                            "real_price": product.css(".a-price[data-a-size=b] .a-offscreen::text").get(),  
                            "rating": (product.css("span[aria-label~=stars]::attr
                            "rating_count": product.css("span[aria-label~=stars] + span::attr(aria-label)").get(),  
                            "thumbnail_url": product.xpath("//img[has-class('s-image')]/@src").get(),  
                        }  
                    )  

                # Get All Pages  
                if "&page=1" in url:  
                    available_pages = sel.xpath(  
                        '//a[has-class("s-pagination-item")][not(has-class("s-pagination-separator"))]/text()'  
                    ).getall()  

                    for page in available_pages:  
                        search_url_paginated = f'https://www.amazon.com/s?k={keyword}&page={page}'  
                        url_list.append(search_url_paginated)  

        except Exception as e:  
            print("Error", e)

This script will collect an array of product data, each represented as a dictionary with the following keys:

  • keyword: The search keyword used (e.g., ‘iPad')
  • asin: The unique Amazon Standard Identification Number of the product
  • url: The URL of the product
  • ad: A Boolean indicating whether the product is an ad
  • title: The title of the product
  • price: The price of the product
  • real_price: The original price of the product before any discounts
  • rating: The rating of the product
  • rating_count: The number of ratings the product has received
  • thumbnail_url: The URL of the product's thumbnail image

The script also identifies all available pages for the search keyword and appends them to the url_list for scraping【9†source】.

Scraping Product Data from Amazon Product Pages

Once you have a list of Amazon product URLs, you can scrape all the product data from each individual Amazon product page. Here's an example script using the Python Requests and Parsel libraries to do this:

import re   
import requests   
from parsel import Selector   
from urllib.parse import urljoin   

product_urls = [  
    'https://www.amazon.com/2021-Apple-10-2-inch-iPad-Wi-Fi/dp/B09G9FPHY6/ref=sr_1_1',  
]  

product_data_list = []  

for product_url in product_urls:  
    try:  
        response = requests.get(product_url)  

        if response.status_code == 200:  
            sel = Selector(text=response.text)  
            image_data = json.loads(re.findall(r"colorImages':.*'initial':\s*(\[.+?\])},\n", response.text)[0])  
            variant_data = re.findall(r'dimensionValuesDisplayData"\s*:\s* ({.+?}),\n', response.text)  
            feature_bullets = [bullet.strip() for bullet in sel.css("#feature-bullets li ::text").getall()]  
            price = sel.css('.a-price span[aria-hidden="true"] ::text').get("")  
            if not price:  
                price = sel.css('.a-price .a-offscreen ::text').get("")  
            product_data_list.append({  
                "name": sel.css("#productTitle::text").get("").strip(),  
                "price": price,  
                "stars": sel.css("i[data-hook=average-star-rating] ::text").get("").strip(),  
                "rating_count": sel.css("div[data-hook=total-review-count] ::text").get("").strip(),  
                "feature_bullets": feature_bullets,  
                "images": image_data,  
                "variant_data": variant_data,  
            })  
    except Exception as e:  
            print("Error",e)

This script collects an array of product data, with each product represented as a dictionary with the following keys:

  • name: The name of the product
  • price: The price of the product
  • stars: The product's star rating
  • rating_count: The total number of reviews the product has received
  • feature_bullets: A list of the product's feature bullets
  • images: A list of high-resolution images of the product
  • variant_data: Data about the product's variants (e.g., different colors or sizes available)

It's worth noting that this script is designed to extract data from product pages with a specific layout. If Amazon changes the layout of its product pages, the script may need to be updated【11†source】.

Additional Considerations

While the above scripts provide a starting point for scraping Amazon, there are additional considerations to take into account for a complete and robust scraping solution:

1. Handling Dynamic Content

Some Amazon product pages use dynamic content, which requires JavaScript to load. If you try to scrape these pages with the methods described above, you might find that some of the data you want is missing. In these cases, you'll need to use a tool that can render JavaScript, like Selenium or Puppeteer.

2. Respecting Robots.txt

Amazon's robots.txt file tells web crawlers which pages they're allowed to visit. While this file isn't legally binding, ignoring it could lead to your IP address being banned. It's best to respect the robots.txt file to avoid any potential issues.

3. Rate Limiting

Amazon might limit the number of requests you can make in a certain time period. If you make too many requests too quickly, Amazon could ban your IP address, you may need proxies for Amazon. To avoid this, you can use techniques like throttling your requests or rotating IP addresses.

4. Ethical Considerations

Web scraping can put significant demand on a website's server, so it's important to scrape responsibly. If you can get the data you need from fewer pages, it's more ethical to do so. For example, if you only need basic product data (name, price, image URL, rating, number of reviews, etc), you can scrape this data from the search pages instead of the product pages, reducing the number of requests you need to make by a factor of 20.


In conclusion, while web scraping can be a powerful tool for extracting data from websites like Amazon, it's important to use these techniques responsibly and with respect to the website's terms of service and the demands you're placing on its servers.

Join the conversation

Your email address will not be published. Required fields are marked *