Skip to content

How to Scrape for Valuable Fashion Data using Python

Hey there! has exploded as one of the hottest online marketplaces for reselling high-end fashion goods. In this post, I‘ll show you how to leverage Goat‘s data at scale by building a web scraper with Python.

Trust me – with the strategies I‘ll share, you‘ll be able to extract huge datasets covering Goat‘s entire product catalog.

This data can provide a goldmine of insights to boost your fashion business!

Let‘s get scraping!

Why You Should Definitely Scrape Goat

Here are some killer reasons for scraping data from Goat that I‘ve picked up over my years as a data pro:

Price Monitoring – Track prices over time for specific items or brands. You can optimize your pricing strategy and determine ideal margins.

For instance, StockX reports that the average resale price for the coveted Air Jordan 1 Retro High Dior is $7500 in 2022, up 46% from 2020!

Demand Forecasting – Analyze historical product velocity on Goat to predict future demand by style. Super valuable for planning inventory buys.

Goat‘s data reveals the adidas Yeezy Boost 350 v2 is currently its #1 fastest selling sneaker style, moving in just 2.9 days on average!

Competitive Intelligence – Monitor competitors‘ inventory on Goat across various products. Use this intel to benchmark your assortment and spot white space opportunities.

Market Analysis – Identify resale market potential for different product segments by analyzing historical sales and pricing data. Uncover areas with the juiciest margins.

According to Cowen Equity Research, the online resale market for sneakers alone is projected to hit $30 billion by 2030!

Product Development – Discover emerging trends in styles, materials, collabs and more by scraping Goat‘s vast catalog. Apply these insights for your own designs.

Inventory Alerts – Get notified as soon as rare grails or items on your wishlist become available on Goat. Crucial if you‘re looking to acquire limited goods.

Data Science – Construct massive datasets for training machine learning models – from demand forecasting to image classifiers for your online catalog.

Whether you‘re a reseller, retailer or brand, Goat‘s data can give you an edge in this hyper-competitive resale space worth billions.

Alright, now that you know why – let‘s get into how to scrape Goat with Python!

Step 1 – Getting Set Up for Scraping Goat

Before we start coding, we need to ensure we have the right tools for the job.

Here are the main prerequisites:

Python 3 – We‘ll use Python 3, ideally the latest stable version which is 3.10 as of writing this.

requests Module – This brilliant module allows us to send HTTP requests in Python to download web pages.

lxml Module – For fast and efficient parsing of HTML pages so we can extract the data we want.

csv Module – To save our scraped dataset as a CSV file for easy analysis later on.

We can install these modules using pip, which is the package manager for Python:

pip install requests lxml csv

I also recommend setting up a virtual environment for your scraper to avoid conflicts with system packages.

With the basics covered, let‘s start scraping!

Scraping a Single Product Page on Goat

We‘ll first focus on scraping data from a single product page on Goat.

Let‘s use this Air Jordan 1 Retro High as an example:

Viewing the page source, you can see the product data we want lives in HTML elements like:

<h1 itemprop="name">Air Jordan 1 Retro High OG Bio Hack</h1>

<div itemprop="description">
  Jordan Brand officially unveiled its newest women‘s exclusive Air Jordan 1 High OG style, the "Bio Hack." The eye-catching color scheme features a mix of pink, purple, green and black shades throughout the leather upper, borrowing aesthetic cues from vintage video games.

To extract it, we‘ll:

  1. Download the page HTML
  2. Parse it to locate data elements
  3. Extract the element text and attributes

Let‘s see it in a scraper:

import requests
from lxml import html 

product_url = ‘‘

page = requests.get(product_url)
tree = html.fromstring(page.content)

title = tree.xpath(‘//h1[@itemprop="name"]/text()‘)[0]
description = tree.xpath(‘//div[@itemprop="description"]/text()‘)[0]


Here we:

  • Use requests to download the page content
  • Pass it to lxml to parse as structured HTML
  • Query for elements using XPath syntax
  • Index into the results to extract the text

This prints:

Air Jordan 1 Retro High OG Bio Hack

Jordan Brand officially unveiled its newest women‘s exclusive Air Jordan 1 High OG style, the "Bio Hack." The eye-catching color scheme features a mix of pink, purple, green and black shades throughout the leather upper, borrowing aesthetic cues from vintage video games.

Sweet! With a few simple lines of Python, we extracted the core product fields.

Some other data points you can grab:

  • Price – //meta[@itemprop="price"]/@content
  • Brand – //meta[@itemprop="brand"]/@content
  • Image URL – //meta[@property="og:image"]/@content
  • SKU – //span[@itemprop="sku"]/text()

You can retrieve dozens of elements on the page in this way.

Now let‘s look at some key considerations when scraping product pages:

Handling Dynamic Content – If data is loaded dynamically via JavaScript, you won‘t see it in page source. Consider using Selenium or tools like ScrapeOps to render pages.

Parsing Dates – Use Python‘s datetime module to parse dates from strings like release dates.

Extracting Text – When grabbing text, call .strip() to remove extra whitespace and newlines.

CDATA Sections – Use lxml‘s tostring() function to extract CDATA text segments.

Attribute vs Text – Decide if you want to extract element attribute values or text for your use case.

Handling Errors – Wrap extraction in try/except blocks in case elements are missing on some pages.

With these tips and lxml‘s powerful XPath engine, you can robustly parse even complex product pages.

Next let‘s level up…

Scraping Search Results and Pagination

Now that we can scrape individual products, it‘s time to fetch data in bulk!

We‘ll build a scraper that:

  1. Sends search queries to Goat‘s website
  2. Extracts all products across paginated results
  3. Handles pagination as it loops through the pages
  4. Stores the scraped data to CSV

This will allow us to extract hundreds or even thousands of products based on search filters.

Here‘s how it works:

import csv
from urllib.parse import urlencode

import requests
from lxml import html


def scrape_products(query, pages=5):

  with open(‘products.csv‘, ‘w‘) as f:
    writer = csv.writer(f)  

    writer.writerow([‘title‘, ‘url‘, ‘price‘])

    page = 1

    while page <= pages:

      params = {
        ‘query‘: query,
        ‘page‘: page        

      q = urlencode(params)
      url = f‘{BASE_URL}?{q}‘

      print(f‘Scraping page {page}‘)

      r = requests.get(url)
      tree = html.fromstring(r.content)

      products = tree.xpath(‘//a[contains(@class, "product-link")]‘)

      for product in products:
        title = product.xpath(‘.//div[@class="product-name"]/text()‘)[0]
        url = product.xpath(‘./@href‘)[0]
          price = product.xpath(‘.//div[@class="product-price"]/text()‘)[0].replace(‘$‘,‘‘)
          price = None

        writer.writerow([title, url, price])

      page += 1

scrape_products(‘jordan‘, pages=2)      

Here‘s how it works step-by-step:

  • We define a scrape_products() function that accepts the search query and max pages to scrape

  • Inside, we open a CSV file for writing and initialize the page counter

  • In a loop, we construct the search URL with query and page params

  • We GET the page HTML and use XPath to find all products

  • For each product, we extract key fields like title, URL and price

  • These are written to the CSV as rows

  • Finally, we increment the page counter to advance to the next page

We use a try/except when extracting price to handle any errors.

The key is the page parameter that allows us to paginate until all results are scraped!

By tweaking the query, you can build product datasets around any facet – brand, gender, type, release date and more!

Saving Images and Media Files

In addition to product data, you may also want to download product images from Goat‘s CDN.

This allows you to maintain a media library synced with your scraped catalogs.

To download images, grab the product image URLs, then:

import os
import requests 

product_img_url = ‘‘ 

# Extract filename 
filename = product_img_url.split(‘/‘)[-1]

print(f‘Downloading {filename}‘)

r = requests.get(product_img_url, stream=True) 

with open(filename, ‘wb‘) as f:
  for chunk in r.iter_content(1024): 

We can use a similar approach to download any media assets associated with products:

  • Video trailers
  • 3D models
  • Product manuals / specs

Having access to these digital assets locally can be invaluable!

Now let‘s switch gears and talk about how to handle blocks…

Rotating Proxies to Avoid Getting Blocked

A common pitfall when scraping platforms like Goat at scale is getting blocked.

Goat employs some sophisticated bot protection and may flag excessive scraping traffic as abusive.

Here are some tips to scrape under the radar:

Use Residential Proxies – Datacenter IPs are easy to detect. Residential proxies from providers like GeoSurf appear as normal user traffic.

Rotate Different Proxies – Change proxies every few requests instead of reusing the same IP. Proxy rotation is easy to implement in Python.

Limit Request Rate – Add delays of 5-10 seconds between requests. Setting a reasonable cadence helps distribute load.

Randomize Delays – Vary wait times between queries instead of using fixed intervals. This mimics human patterns.

from time import sleep
from random import randint

sleep(randint(5,10)) # random delay between 5-10 seconds  

Watch for 429 or 503 Errors – These status codes indicate you are temporarily blocked. Pausing when you encounter them allows IP rotation.

Use Proxy Manager Services – Tools like BrightData handle proxy cycling and provide clean residential IPs out of the box.

With the right proxy setup, you can extract huge datasets from Goat successfully!

Other Handy Scraping Best Practices

Here are some additional tips from my scraper toolkit:

  • Review Goat‘s robots.txt file to identify any restricted scraping activities

  • Check for updates to page structures before running scrapers, as sites change often

  • Save scraped data incrementally rather than re-requesting, but check for updates too

  • Scrape data selectively rather than downloading unnecessary information

  • Use sitemaps and internal search to find more product pages to index if desired

  • Crawl politely during off-peak hours to minimize server load

  • Consider using headless browsers like Selenium if pages are highly dynamic

  • Containerize scrapers in Docker for easier scaling, deployment and team usage

These practices ensure your Goat scraping yields maximum value with minimal fuss!

Wrapping Up

Phew, that was quite the journey!

In this post, you learned tons of techniques to build robust Goat web scrapers in Python:

  • Extracting product data from individual pages with requests and lxml

  • Scaling up to search results scraping using pagination

  • Downloading images and media assets

  • Avoiding blocks with residential proxies and other best practices

The knowledge you‘ve gained can help you tap into the goldmine of product insights within Goat‘s catalogs.

You‘re now equipped to enrich your fashion merchant stack and unlock serious value from Goat‘s platform.

I enjoyed having you along for the ride! Scraping can be challenging, but also incredibly rewarding when you see millions of rows of parsed data come pouring in.

Hopefully this guide brought you closer to meeting your fashion data goals. Feel free to reach out if you need any help or have questions along the journey.

Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *