Here is a 2000+ word comprehensive guide on scraping Fashionphile for second hand fashion data:
Fashionphile is one of the largest and most popular second-hand fashion marketplaces online. With luxury brands like Chanel, Louis Vuitton, Hermès, and more, it‘s a treasure trove of high-end fashion data. In this guide, we‘ll walk through different techniques to scrape Fashionphile product listings at scale using Python.
Why Scrape Fashionphile?
Here are some of the key reasons one might want to scrape data from Fashionphile:
-
Market Research – Fashionphile sells thousands of luxury items across hundreds of designer brands. Scraping this data provides great insights into second-hand market prices, demand, inventory levels, and more. This is invaluable market intelligence for luxury fashion retailers.
-
Price Monitoring – With constantly changing inventory, it‘s useful to monitor prices over time for pricing studies. Web scraping enables continuous monitoring and tracking as new items get listed on Fashionphile.
-
Inventory Monitoring – Luxury resellers like Fashionphile get new consignments daily. Scraping gives insights into new inventory being listed across designers, categories, prices etc.
-
Keyword Research – Product titles, descriptions and tags are a goldmine of keywords. These can be extracted via web scraping and used for SEO and advertising campaigns.
-
Competitor Research – Understanding assortment, pricing and promotions for a competitor like Fashionphile is key for any luxury reseller. Web scraping provides the data behind these insights.
-
Lead Generation – Contact information of high-end fashion sellers can be valuable leads for customer acquisition efforts. Items list the city/state of the seller.
In summary, Fashionphile is a prime target for web scraping due to the depth of high-quality data. The potential use cases are plentiful.
Overview of Fashionphile‘s Website
Before we dive into code, let‘s briefly understand how Fashionphile‘s website is structured:
-
Product Pages – Each product has its own page (e.g. https://www.fashionphile.com/p/chanel-lambskin-quilted-medium-double-flap-black-1007734) with details like title, description, price, images, shipping etc.
-
Category Pages – Listings can be browsed by category (e.g. https://www.fashionphile.com/shop/chanel-bags) with pagination.
-
Search Pages – Search queries produce paginated results (e.g. https://www.fashionphile.com/shop?search=chanel+classic+flap).
-
Sitemaps – XML sitemaps list out all product URLs for indexing by search engines.
This structure is quite typical for any ecommerce website. The key then is figuring out how to extract structured data from the underlying HTML.
Scrape Product Page Data
Let‘s start with extracting details from a single product page. Here‘s an example:
https://www.fashionphile.com/p/chanel-lambskin-quilted-medium-double-flap-black-1007734
Viewing the page source, we can see that the product data is conveniently available as a JSON object inside a tag:
<script id="__NEXT_DATA__" type="application/json">
{
"props": {
"pageProps": {
...
"productPageReducer": {
"productData": {
"id": 1007734,
"title": "CHANEL Lambskin Quilted Medium Double Flap Black",
...
"price": 4795,
"images": [
"https://prod-images.fashionphile.com/016652c727c49d9ddeaae9f22efb7e1a/4b54909a5093afb22a592857bc5e586a.jpg",
...
]
}
}
}
}
}
</script>
This structured data can be conveniently extracted using CSS selectors:
import json
from parsel import Selector
product_html = # fetch product page HTML
# Extract JSON object
json_data = Selector(product_html).css(‘script#__NEXT_DATA__::text‘).get()
# Deserialize JSON
data = json.loads(json_data)
# Drill down to product data
product = data[‘props‘][‘pageProps‘][‘productPageReducer‘][‘productData‘]
print(product[‘title‘])
print(product[‘price‘])
print(product[‘images‘])
The key steps are:
- Use
Selector
to extract the<script>
tag containing the JSON data - Deserialize the JSON string into a Python dictionary with
json.loads()
- Drill down into the nested structure to get the product details
And we have cleanly extracted the core product data! The same approach applies to any product URL.
This JSON technique is very convenient compared to parsing HTML directly. Many modern websites use similar JSON data structures with all the underlying product data.
Crawl Category Listings
Now that we can extract data for individual products, let‘s look at collecting data at scale.
We‘ll start with crawling category pages which contain paginated listings for a designer. Here‘s an example Chanel category:
https://www.fashionphile.com/shop/chanel-bags
This page contains 72 products spread over 4 pages. To extract all listings we‘ll need to:
- Fetch the page HTML
- Extract total number of pages from the HTML
- Loop through each page and extract products
Here‘s how it looks in Python:
import math
from parsel import Selector
category_url = ‘https://www.fashionphile.com/shop/chanel-bags‘
def scrape_products(html):
# Extract JSON data
json = Selector(html).css(‘script#__NEXT_DATA__::text‘).get()
# Get products array
data = json.loads(json)
products = data[‘props‘][‘pageProps‘][‘categoryPageReducer‘][‘categoryData‘][‘products‘]
return products
def get_total_pages(html):
# Extract pagination data
json = Selector(html).css(‘script#__NEXT_DATA__::text‘).get()
data = json.loads(json)
# Calculate pages
total_products = data[‘props‘][‘pageProps‘][‘categoryPageReducer‘][‘categoryData‘][‘totalProducts‘]
per_page = data[‘props‘][‘pageProps‘][‘categoryPageReducer‘][‘categoryData‘][‘productsPerPage‘]
total_pages = math.ceil(total_products / per_page)
return total_pages
# Fetch first page
first_page_html = # request first page
# Get products from page
products = scrape_products(first_page_html)
# Get total pages
total_pages = get_total_pages(first_page_html)
# Crawl remaining pages
for page in range(2, total_pages+1):
url = f‘{category_url}?page={page}‘
html = # fetch page HTML
page_products = scrape_products(html)
# Combine with master products list
products.extend(page_products)
print(len(products))
# Prints total products scraped across all pages
The key steps are:
- Parse total pages from first page HTML
- Iterate through each page by incrementing the
?page=
parameter - Extract products JSON from each page
- Combine all products into a single list
This paginated crawling technique can be applied to any category, search or sorted listings page.
Leverage Sitemaps to Crawl All Product URLs
Now that we can extract products from individual pages, we need a way to discover all product URLs to crawl.
One easy option is to leverage the Fashionphile sitemap.
Sitemaps provide a list of all URLs on a website for search engine crawlers. Here is the Fashionphile sitemap index:
https://www.fashionphile.com/sitemap-index.xml
This XML file contains references to all the individual sitemap files. We first need to parse out all the listed sitemaps:
import xml.etree.ElementTree as ET
sitemap_index = ‘https://www.fashionphile.com/sitemap-index.xml‘
# Parse XML
root = ET.fromstring(requests.get(sitemap_index).text)
# Get all sitemap URLs
sitemaps = [elem.text for elem in root.findall(‘./sitemap/loc‘)]
print(sitemaps)
This prints out URLs of all the sub-sitemaps like:
[‘https://s3-us-west-2.amazonaws.com/fashionphile/sitemaps/category-sitemap1.xml‘,
‘https://s3-us-west-2.amazonaws.com/fashionphile/sitemaps/category-sitemap2.xml‘,
‘https://s3-us-west-2.amazonaws.com/fashionphile/sitemaps/category-sitemap3.xml‘,
...]
We can then loop through each sub-sitemap and extract product URLs:
product_urls = []
for sitemap in sitemaps:
# Parse XML
root = ET.fromstring(requests.get(sitemap).text)
# Get all product URLs
urls = [elem.text for elem in root.findall(‘./url/loc‘)]
# Add to master list
product_urls.extend(urls)
print(len(product_urls))
# Prints total product URLs
So using the sitemap index we can aggregate a master list of all 500k+ product URLs on Fashionphile!
We can then feed this into our product crawl code to scrape fresh data for every product listing on the site.
Crawl Based on Historical Listings
Sitemaps provide the most up-to-date view of listings on Fashionphile.
Another approach is to leverage historical scrape data to guide crawls.
The methodology would be:
- Do a large scale scrape of Fashionphile and store product listings
- In each subsequent scrape, parse out all listings from the historical data
- For each historical URL, check if it still returns a 200 status code
- If
200 OK
, scrape updated product data - If
404 Not Found
, remove deleted listing from dataset
- If
- For any new listings, add to dataset
This allows keeping the scraped dataset in sync with latest inventory and prices on Fashionphile.
You can even track price history and other trends over time as products get updated. The historical data powers more advanced analytics.
Scrape Data At Scale with Scrapy
All the techniques covered so far use basic Python libraries like Requests and Parsel for scraping.
For large scale production scrapes, a dedicated web crawling framework like Scrapy is recommended.
Scrapy provides several advantages:
- Powerful crawling with in-built scheduler, downloader, spider classes etc.
- Asynchronous IO for very fast scraping
- Built-in caching, throttling, cookies, proxy handling
- item pipelines to cleanly store scraped data
- Extensive debugging information in logs
- Handy crawl output like stats and reports
Here is a simple Scrapy spider to crawl Fashionphile products:
import scrapy
import json
class FashionphileSpider(scrapy.Spider):
name = ‘fashionphile‘
def start_requests(self):
yield scrapy.Request(url=‘https://www.fashionphile.com/shop/louis-vuitton-bags‘)
def parse(self, response):
# Extract JSON data
json_data = response.css(‘script#__NEXT_DATA__::text‘).get()
data = json.loads(json_data)
# Yield Product Items
for product in data[‘props‘][‘pageProps‘][‘categoryPageReducer‘][‘categoryData‘][‘products‘]:
yield {
‘title‘: product[‘title‘],
‘price‘: product[‘price‘],
‘url‘: product[‘url‘],
}
# Crawl to next page
next_page = response.css(‘a.next::attr(href)‘).get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
This implements pagination logic and product parsing in a much cleaner fashion.
The spider can then be run to efficiently crawl 10s of thousands of listings:
scrapy crawl fashionphile -o products.json
Overall, Scrapy provides a very scalable platform for production web scraping.
Use Proxies to Avoid Blocking
A challenge with scraping any commercial site at scale is inevitably getting blocked.
Fashionphile employs several anti-scraping mechanisms:
-
Rate Limiting – Limits number of requests sent from a single IP.
-
caps – Requires solving CAPTCHAs after a certain number of requests.
-
IP Blocking – Bans IP addresses doing large volumes of requests.
Rotating proxies is an effective strategy to avoid these limitations:
-
Shared Proxies – Use proxy services like BrightData, Oxylabs, Smartproxy etc. to get thousands of shared residential and mobile proxies.
-
Self-hosted Proxies – For fully private proxies, use proxy management tools like ProxyMesh to orchestrate servers across datacenters.
-
Proxy Rotation – Rotate proxies per request or every few hundred requests to constantly use new IPs.
Here is an example using BrightData proxies with each Scrapy request:
from scrapy import signals
class ProxyMiddleware:
def process_request(self, request, spider):
request.meta[‘proxy‘] = f‘http://{get_proxy()}‘
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
return request
def get_proxy():
# Call BrightData API to get proxy
return proxy
class FashionphileSpider(scrapy.Spider):
custom_settings = {
‘DOWNLOADER_MIDDLEWARES‘: {
‘scrapy_proxies.RandomProxy‘: 100,
‘ProxyMiddleware‘: 500
}
}
# Rest of spider code
The key pieces are:
- Enabling the
ProxyMiddleware
- Getting a new proxy before each request
- Passing proxy via
request.meta
This ensures every request uses a different IP and avoids any blocks.
Conclusion
Scraping a site like Fashionphile requires dealing with modern JavaScript rendered pages, pagination, proxies and more.
Here are some key takeaways:
- Leverage JSON data structures instead of parsing HTML directly
- Paginate through category/search pages to scale data
- Use sitemaps or historical data to get all product URLs
- Move to Scrapy once at production scale
- Rotate proxies to avoid bot protection
The data available is incredibly valuable for market research, monitoring competitors, lead generation and more.
Putting in the engineering effort pays dividends across business uses cases. This guide should provide a blueprint to build your own Fashionphile scraper.
Let me know if you have any other questions! I‘m happy to help provide any other tips.