Hey there! Goat.com has exploded as one of the hottest online marketplaces for reselling high-end fashion goods. In this post, I‘ll show you how to leverage Goat‘s data at scale by building a web scraper with Python.
Trust me – with the strategies I‘ll share, you‘ll be able to extract huge datasets covering Goat‘s entire product catalog.
This data can provide a goldmine of insights to boost your fashion business!
Let‘s get scraping!
Why You Should Definitely Scrape Goat
Here are some killer reasons for scraping data from Goat that I‘ve picked up over my years as a data pro:
Price Monitoring – Track prices over time for specific items or brands. You can optimize your pricing strategy and determine ideal margins.
For instance, StockX reports that the average resale price for the coveted Air Jordan 1 Retro High Dior is $7500 in 2022, up 46% from 2020!
Demand Forecasting – Analyze historical product velocity on Goat to predict future demand by style. Super valuable for planning inventory buys.
Goat‘s data reveals the adidas Yeezy Boost 350 v2 is currently its #1 fastest selling sneaker style, moving in just 2.9 days on average!
Competitive Intelligence – Monitor competitors‘ inventory on Goat across various products. Use this intel to benchmark your assortment and spot white space opportunities.
Market Analysis – Identify resale market potential for different product segments by analyzing historical sales and pricing data. Uncover areas with the juiciest margins.
According to Cowen Equity Research, the online resale market for sneakers alone is projected to hit $30 billion by 2030!
Product Development – Discover emerging trends in styles, materials, collabs and more by scraping Goat‘s vast catalog. Apply these insights for your own designs.
Inventory Alerts – Get notified as soon as rare grails or items on your wishlist become available on Goat. Crucial if you‘re looking to acquire limited goods.
Data Science – Construct massive datasets for training machine learning models – from demand forecasting to image classifiers for your online catalog.
Whether you‘re a reseller, retailer or brand, Goat‘s data can give you an edge in this hyper-competitive resale space worth billions.
Alright, now that you know why – let‘s get into how to scrape Goat with Python!
Step 1 – Getting Set Up for Scraping Goat
Before we start coding, we need to ensure we have the right tools for the job.
Here are the main prerequisites:
Python 3 – We‘ll use Python 3, ideally the latest stable version which is 3.10 as of writing this.
requests Module – This brilliant module allows us to send HTTP requests in Python to download web pages.
lxml Module – For fast and efficient parsing of HTML pages so we can extract the data we want.
csv Module – To save our scraped dataset as a CSV file for easy analysis later on.
We can install these modules using pip, which is the package manager for Python:
pip install requests lxml csv
I also recommend setting up a virtual environment for your scraper to avoid conflicts with system packages.
With the basics covered, let‘s start scraping!
Scraping a Single Product Page on Goat
We‘ll first focus on scraping data from a single product page on Goat.
Let‘s use this Air Jordan 1 Retro High as an example:
https://www.goat.com/sneakers/air-jordan-1-retro-high-og-bio-hack-555088-711
Viewing the page source, you can see the product data we want lives in HTML elements like:
<h1 itemprop="name">Air Jordan 1 Retro High OG Bio Hack</h1>
<div itemprop="description">
Jordan Brand officially unveiled its newest women‘s exclusive Air Jordan 1 High OG style, the "Bio Hack." The eye-catching color scheme features a mix of pink, purple, green and black shades throughout the leather upper, borrowing aesthetic cues from vintage video games.
</div>
To extract it, we‘ll:
- Download the page HTML
- Parse it to locate data elements
- Extract the element text and attributes
Let‘s see it in a scraper:
import requests
from lxml import html
product_url = ‘https://www.goat.com/sneakers/air-jordan-1-retro-high-og-bio-hack-555088-711‘
page = requests.get(product_url)
tree = html.fromstring(page.content)
title = tree.xpath(‘//h1[@itemprop="name"]/text()‘)[0]
description = tree.xpath(‘//div[@itemprop="description"]/text()‘)[0]
print(title)
print(description)
Here we:
- Use
requests
to download the page content - Pass it to
lxml
to parse as structured HTML - Query for elements using XPath syntax
- Index into the results to extract the text
This prints:
Air Jordan 1 Retro High OG Bio Hack
Jordan Brand officially unveiled its newest women‘s exclusive Air Jordan 1 High OG style, the "Bio Hack." The eye-catching color scheme features a mix of pink, purple, green and black shades throughout the leather upper, borrowing aesthetic cues from vintage video games.
Sweet! With a few simple lines of Python, we extracted the core product fields.
Some other data points you can grab:
- Price –
//meta[@itemprop="price"]/@content
- Brand –
//meta[@itemprop="brand"]/@content
- Image URL –
//meta[@property="og:image"]/@content
- SKU –
//span[@itemprop="sku"]/text()
You can retrieve dozens of elements on the page in this way.
Now let‘s look at some key considerations when scraping product pages:
Handling Dynamic Content – If data is loaded dynamically via JavaScript, you won‘t see it in page source. Consider using Selenium or tools like ScrapeOps to render pages.
Parsing Dates – Use Python‘s datetime
module to parse dates from strings like release dates.
Extracting Text – When grabbing text, call .strip()
to remove extra whitespace and newlines.
CDATA Sections – Use lxml
‘s tostring()
function to extract CDATA text segments.
Attribute vs Text – Decide if you want to extract element attribute values or text for your use case.
Handling Errors – Wrap extraction in try/except
blocks in case elements are missing on some pages.
With these tips and lxml
‘s powerful XPath engine, you can robustly parse even complex product pages.
Next let‘s level up…
Scraping Search Results and Pagination
Now that we can scrape individual products, it‘s time to fetch data in bulk!
We‘ll build a scraper that:
- Sends search queries to Goat‘s website
- Extracts all products across paginated results
- Handles pagination as it loops through the pages
- Stores the scraped data to CSV
This will allow us to extract hundreds or even thousands of products based on search filters.
Here‘s how it works:
import csv
from urllib.parse import urlencode
import requests
from lxml import html
BASE_URL = ‘https://www.goat.com/search‘
def scrape_products(query, pages=5):
with open(‘products.csv‘, ‘w‘) as f:
writer = csv.writer(f)
writer.writerow([‘title‘, ‘url‘, ‘price‘])
page = 1
while page <= pages:
params = {
‘query‘: query,
‘page‘: page
}
q = urlencode(params)
url = f‘{BASE_URL}?{q}‘
print(f‘Scraping page {page}‘)
r = requests.get(url)
tree = html.fromstring(r.content)
products = tree.xpath(‘//a[contains(@class, "product-link")]‘)
for product in products:
title = product.xpath(‘.//div[@class="product-name"]/text()‘)[0]
url = product.xpath(‘./@href‘)[0]
try:
price = product.xpath(‘.//div[@class="product-price"]/text()‘)[0].replace(‘$‘,‘‘)
except:
price = None
writer.writerow([title, url, price])
page += 1
scrape_products(‘jordan‘, pages=2)
Here‘s how it works step-by-step:
-
We define a
scrape_products()
function that accepts the search query and max pages to scrape -
Inside, we open a CSV file for writing and initialize the page counter
-
In a loop, we construct the search URL with
query
andpage
params -
We GET the page HTML and use XPath to find all products
-
For each product, we extract key fields like title, URL and price
-
These are written to the CSV as rows
-
Finally, we increment the page counter to advance to the next page
We use a try/except
when extracting price to handle any errors.
The key is the page parameter that allows us to paginate until all results are scraped!
By tweaking the query, you can build product datasets around any facet – brand, gender, type, release date and more!
Saving Images and Media Files
In addition to product data, you may also want to download product images from Goat‘s CDN.
This allows you to maintain a media library synced with your scraped catalogs.
To download images, grab the product image URLs, then:
import os
import requests
product_img_url = ‘https://image.goat.com/attachments/...png‘
# Extract filename
filename = product_img_url.split(‘/‘)[-1]
print(f‘Downloading {filename}‘)
r = requests.get(product_img_url, stream=True)
with open(filename, ‘wb‘) as f:
for chunk in r.iter_content(1024):
f.write(chunk)
We can use a similar approach to download any media assets associated with products:
- Video trailers
- 3D models
- Product manuals / specs
Having access to these digital assets locally can be invaluable!
Now let‘s switch gears and talk about how to handle blocks…
Rotating Proxies to Avoid Getting Blocked
A common pitfall when scraping platforms like Goat at scale is getting blocked.
Goat employs some sophisticated bot protection and may flag excessive scraping traffic as abusive.
Here are some tips to scrape under the radar:
Use Residential Proxies – Datacenter IPs are easy to detect. Residential proxies from providers like GeoSurf appear as normal user traffic.
Rotate Different Proxies – Change proxies every few requests instead of reusing the same IP. Proxy rotation is easy to implement in Python.
Limit Request Rate – Add delays of 5-10 seconds between requests. Setting a reasonable cadence helps distribute load.
Randomize Delays – Vary wait times between queries instead of using fixed intervals. This mimics human patterns.
from time import sleep
from random import randint
sleep(randint(5,10)) # random delay between 5-10 seconds
Watch for 429 or 503 Errors – These status codes indicate you are temporarily blocked. Pausing when you encounter them allows IP rotation.
Use Proxy Manager Services – Tools like BrightData handle proxy cycling and provide clean residential IPs out of the box.
With the right proxy setup, you can extract huge datasets from Goat successfully!
Other Handy Scraping Best Practices
Here are some additional tips from my scraper toolkit:
-
Review Goat‘s robots.txt file to identify any restricted scraping activities
-
Check for updates to page structures before running scrapers, as sites change often
-
Save scraped data incrementally rather than re-requesting, but check for updates too
-
Scrape data selectively rather than downloading unnecessary information
-
Use sitemaps and internal search to find more product pages to index if desired
-
Crawl politely during off-peak hours to minimize server load
-
Consider using headless browsers like Selenium if pages are highly dynamic
-
Containerize scrapers in Docker for easier scaling, deployment and team usage
These practices ensure your Goat scraping yields maximum value with minimal fuss!
Wrapping Up
Phew, that was quite the journey!
In this post, you learned tons of techniques to build robust Goat web scrapers in Python:
-
Extracting product data from individual pages with
requests
andlxml
-
Scaling up to search results scraping using pagination
-
Downloading images and media assets
-
Avoiding blocks with residential proxies and other best practices
The knowledge you‘ve gained can help you tap into the goldmine of product insights within Goat‘s catalogs.
You‘re now equipped to enrich your fashion merchant stack and unlock serious value from Goat‘s platform.
I enjoyed having you along for the ride! Scraping can be challenging, but also incredibly rewarding when you see millions of rows of parsed data come pouring in.
Hopefully this guide brought you closer to meeting your fashion data goals. Feel free to reach out if you need any help or have questions along the journey.
Happy scraping!