How to Scrape StockX e-commerce Data with Python

Scraping e-commerce websites like StockX can provide valuable insights into product availability, pricing trends, and more. In this comprehensive guide, we‘ll walk through how to scrape StockX product data using Python.

Overview of StockX

For the uninitiated, StockX is an online marketplace that focuses on reselling sneakers, streetwear, handbags, and other high-demand products. StockX has often been described as the "stock market of things" because it facilitates real-time bidding on products just like a stock exchange.

Some key things to know about StockX:

Products are typically limited editions from brands like Nike, Adidas, Supreme, Louis Vuitton etc.
Instead of fixed prices, products are bid on just like stocks. Prices fluctuate based on supply and demand.
StockX authenticates and verifies every item on their marketplace.
In addition to bidding, buyers can also purchase items outright at the lowest "Ask" price.

Given the hype around many of the products on StockX, having up-to-date pricing and availability data can give resellers a significant edge. This makes StockX an ideal target for web scraping.

Is it Legal to Scrape StockX?

Before we dig into the technical details, let‘s briefly discuss the legality of scraping StockX.

In general, it is perfectly legal to scrape public websites as long as you do not:

Try to circumvent blocks or access private data.
Cause damages such as overloading servers.
Violate the website‘s Terms of Service.

StockX‘s Terms of Use do not explicitly prohibit web scraping or data collection. As long as we scrape responsibly and do not republish large portions of StockX‘s content, scraping their marketplace data is legally permissible.

Scraping StockX Product Pages

With the legality squared away, let‘s look at how to scrape data from StockX product pages.

Here is an example product page for a Nike x Off-White Hoodie:

This page contains all the key data points we would want to extract:

Product title, images, description etc.
Real-time bid and ask prices.
Historical price chart.
Sales volume over time.
Available sizes and colorways.

Simply scraping the HTML of this page would give us access to some of this data. However, StockX actually loads much of the structured data (prices, charts etc.) from a JSON object contained in a <script> tag.

Here‘s what that hidden JSON data looks like for the Nike hoodie example:

{
  "product": {
    "id": "0183b5c5-c4bd-47a7-9461-adc72b5acb39",
    "title": "Nike Sportswear Club Fleece Hoodie (Off-White / Black)",
    // ...other fields omitted for brevity

    "market": { 
      "lowestAsk": 289,
      "highestBid": 265,

      "volatility": {
        "week": 15.27
      },

      "charts": [
        {
          "name": "SALES",
          "series": [
            {
              "name": "DEADSTOCK",
              "data": [
                // historical sales data
              ]
            }  
          ]
        },
        {
          "name": "PRICE premium",
          "series": [
            {
              "name": "USED",
              "data": [
                // historical price data
              ]
            },
            {  
              "name": "NEW NO BOX",
              "data": []
            },
            {
              "name": "NEW NO LACES",
              "data": [] 
            }
          ]
        }
      ] 
    },

    "variants": [
      {
        "traits": { "size": "S" },
        "market": {
          // bid/ask data for Small size 
        }
      },
      {
        "traits": { "size": "M" },
        // etc...
      }
    ]
  }
}

As you can see, this object contains a wealth of structured data on pricing, volatility, sales history, price charts, and product variants. Scraping this would be much more useful than just extracting text from the HTML.

Here is some sample code to extract this hidden JSON data from a page using the Parses library:

import json
from parsel import Selector

def extract_json_data(html):
  sel = Selector(html)

  json_data = sel.css(‘script#__NEXT_DATA__::text‘).get()
  if json_data:
    return json.loads(json_data)

  return None

To fetch and parse a page:

import requests
from extract_json_data import extract_json_data

url = ‘https://stockx.com/nike-sportswear-club-fleece-hoodie-off-white-black‘

response = requests.get(url)
data = extract_json_data(response.text)

print(data[‘product‘][‘title‘]) 
# Prints: Nike Sportswear Club Fleece Hoodie (Off-White / Black)

And that‘s the basic technique for scraping key data from StockX product pages! The hidden JSON object contains a wealth of data across multiple nested levels, so you would need to traverse it to extract the specific fields you want.

Some key tips when scraping product pages:

Use a header with a valid User-Agent to mimic a real browser. This helps avoid blocks.
Set the Accept-Encoding header so the response is compressed (gzip), saving bandwidth.
Add timeouts, retries and other robustness best practices. StockX is very popular so expect failures.
Consider using asynchronous requests if scraping multiple pages.
Be wary of scraping too aggressively as that risks getting blocked.

Finding Products to Scrape

Now that we can scrape individual products, we need a way to discover products to scrape.

StockX has two main entry points for this:

Sitemaps – StockX provides XML sitemaps listing all products.

Search Pages – We can scrape search results for products matching keywords, brands etc.

Let‘s look at both approaches:

Scraping Sitemaps

StockX‘s sitemap index is located at https://stockx.com/sitemap_index.xml. This aggregates sitemaps for various categories and locales.

Here is how we could scrape and parse it in Python using the Scrapy library:

import scrapy

class StockXSitemapSpider(scrapy.Spider):
  name = ‘stockx_sitemap‘

  def start_requests(self):
    yield scrapy.Request(‘https://stockx.com/sitemap_index.xml‘)

  def parse(self, response):
    for sitemap_url in response.xpath(‘//s:sitemap/s:loc/text()‘):
      yield response.follow(
        sitemap_url, 
        callback=self.parse_sitemap  
      )

  def parse_sitemap(self, response):
    for item_url in response.xpath(‘//s:url/s:loc/text()‘):
      yield {
        ‘url‘: item_url
      }

This recursively crawls all nested sitemaps, extracting product URLs which we could feed into our product scraper.

The sitemaps provide a comprehensive list of all StockX products so this approach is great for broad crawls. However, it does not allow filtering or querying for specific products. For that, we‘ll need to scrape StockX search.

Scraping Search Pages

StockX search allows looking up products by keyword, category filters (sneakers, streetwear etc.), brands, gender and more.

For example, here is a search for popular red sneakers:

These search result pages contain a subset of the product data – enough to determine if we want to scrape the full product page.

Once again, StockX loads this search data from a JSON object in a <script> tag. It looks like this:

{
  "results": {
    "products": [
      {
        "urlKey": "air-jordan-1-retro-high-dark-mocha",
        "title": "Air Jordan 1 Retro High Dark Mocha",
        "image": "https://image.goat.com/..." 
        //...
      },
      {
        "urlKey": "air-jordan-1-zoom-cmft-red-chili",
       //... 
      }
    ],
    "pageInfo": {
      "totalProducts": 223,
      "productsPerPage": 55,
      "page": 1, 
      "totalPages": 5      
    }
  }
}

It contains preview data for each product, along with pagination details that allow scraping additional pages.

Here is some sample code to scrape search results:

import math
import json 
from parsel import Selector

def extract_search_data(html):

  sel = Selector(html)
  data = sel.css(‘script[data-zrr-shared-data-key]::text‘).get()

  # Extract JSON
  return json.loads(data)

def scrape_search_results(search_url, max_pages=10):

  # Fetch first page
  response = requests.get(search_url)
  search_data = extract_search_data(response.text)

  products = []

  # Extract data from first page
  for product in search_data[‘results‘][‘products‘]:
    products.append(product)

  total_pages = min(max_pages, math.ceil(search_data[‘results‘][‘pageInfo‘][‘totalPages‘]))  

  # Fetch remaining pages concurrently
  with concurrent.futures.ThreadPoolExecutor() as executor:
    futures = [
      executor.submit(
        requests.get, 
        f"{search_url}&page={page+1}"  
      )
      for page in range(1, total_pages)
    ]

    for future in concurrent.futures.as_completed(futures):
      data = extract_search_data(future.result().text)
      for product in data[‘results‘][‘products‘]:
        products.append(product)

  return products # List of all products from all pages

This allows us to query for specific products that match search criteria, pagination handling, and scraping asynchronously. From here, we could feed the results into our product scraper to get full data.

Some tips for scraping StockX search:

Monitor the totalPages field to avoid hitting pages that don‘t exist.
Add throttling and retries as search can be flaky.
Authenticate the request if possible to access more results per page.
Prefer asynchronous scraping when fetching multiple pages.

So in summary, sitemaps allow comprehensive scraping while search provides more control over the products scraped. Combine both approaches for best results!

Scaling Up with a Scraping Service

While the examples above will work fine for small scrapers, large scale scraping brings additional challenges:

Avoiding blocks from aggressively scraping.
Proxies and residential IPs to scrape from different locations.
Built-in caching, throttling, retries and async handling.
Javascript rendering and handling browser challenges.

Rather than building all this robustness yourself, services like Scrapfly provide a full web scraping API to handle these complexities.

Here is an example of how the product scraper could be ported to use Scrapfly:

from webscrapingsite import ScrapeClient

client = ScrapflyClient(api_key=‘XXX‘)

def scrape_product(url):
  response = client.get(url) 

  # Call our existing parser
  data = extract_json_data(response.text)  

  return data[‘product‘]

# Usage:
product_data = scrape_product(‘https://stockx.com/...‘)

By handling all the scraping infrastructure and mitigating blocks, Scrapfly allows you to focus on the data extraction portion. Some other benefits include:

Millions of residential proxies to appear as real users.
Browser automation for dynamic content.
Intelligent caching and retries built-in.
geographic controls for regional data.

Scrapfly also offers helpful additions like screenshots, HTML download, and GDPR compliance making it a robust scraping platform.

Conclusion

In this post, we walked through various techniques for scraping data from StockX using Python:

Product pages can be parsed for their rich hidden JSON data via <script> tags. This provides detailed information on variants, sales, prices, volatility and more.
Sitemaps give a complete index of all products on StockX which we can crawl.
Search pages allow looking up products matching specific keywords, brands etc. Convenient for focused scraping.
For large scale scraping, services like Scrapfly handle infrastructure and scraping challenges allowing you to focus on data extraction.

With some clever parsing and asynchronous workflows, you can build a comprehensive StockX data scraper. The marketplace data provides valuable insights for resellers, retailers and more.

Hopefully this post provides a blueprint for rolling your own StockX scraper or leveraging services to accelerate your scraping project. Let me know if you have any other questions!

Overview of StockX

Is it Legal to Scrape StockX?

Scraping StockX Product Pages

Finding Products to Scrape

Scraping Sitemaps

Scraping Search Pages

Scaling Up with a Scraping Service

Conclusion

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python