How to Scrape Fashionphile for Second Hand Fashion Data

Here is a 2000+ word comprehensive guide on scraping Fashionphile for second hand fashion data:

Fashionphile is one of the largest and most popular second-hand fashion marketplaces online. With luxury brands like Chanel, Louis Vuitton, Hermès, and more, it‘s a treasure trove of high-end fashion data. In this guide, we‘ll walk through different techniques to scrape Fashionphile product listings at scale using Python.

Why Scrape Fashionphile?

Here are some of the key reasons one might want to scrape data from Fashionphile:

Market Research – Fashionphile sells thousands of luxury items across hundreds of designer brands. Scraping this data provides great insights into second-hand market prices, demand, inventory levels, and more. This is invaluable market intelligence for luxury fashion retailers.
Price Monitoring – With constantly changing inventory, it‘s useful to monitor prices over time for pricing studies. Web scraping enables continuous monitoring and tracking as new items get listed on Fashionphile.
Inventory Monitoring – Luxury resellers like Fashionphile get new consignments daily. Scraping gives insights into new inventory being listed across designers, categories, prices etc.
Keyword Research – Product titles, descriptions and tags are a goldmine of keywords. These can be extracted via web scraping and used for SEO and advertising campaigns.
Competitor Research – Understanding assortment, pricing and promotions for a competitor like Fashionphile is key for any luxury reseller. Web scraping provides the data behind these insights.
Lead Generation – Contact information of high-end fashion sellers can be valuable leads for customer acquisition efforts. Items list the city/state of the seller.

In summary, Fashionphile is a prime target for web scraping due to the depth of high-quality data. The potential use cases are plentiful.

Overview of Fashionphile‘s Website

Before we dive into code, let‘s briefly understand how Fashionphile‘s website is structured:

Product Pages – Each product has its own page (e.g. https://www.fashionphile.com/p/chanel-lambskin-quilted-medium-double-flap-black-1007734) with details like title, description, price, images, shipping etc.
Category Pages – Listings can be browsed by category (e.g. https://www.fashionphile.com/shop/chanel-bags) with pagination.
Search Pages – Search queries produce paginated results (e.g. https://www.fashionphile.com/shop?search=chanel+classic+flap).
Sitemaps – XML sitemaps list out all product URLs for indexing by search engines.

This structure is quite typical for any ecommerce website. The key then is figuring out how to extract structured data from the underlying HTML.

Scrape Product Page Data

Let‘s start with extracting details from a single product page. Here‘s an example:

https://www.fashionphile.com/p/chanel-lambskin-quilted-medium-double-flap-black-1007734

Viewing the page source, we can see that the product data is conveniently available as a JSON object inside a tag:

<script id="__NEXT_DATA__" type="application/json">
  {
    "props": {
      "pageProps": {
         ...
        "productPageReducer": {
          "productData": {
             "id": 1007734,
             "title": "CHANEL Lambskin Quilted Medium Double Flap Black",
             ...
             "price": 4795,
             "images": [
                "https://prod-images.fashionphile.com/016652c727c49d9ddeaae9f22efb7e1a/4b54909a5093afb22a592857bc5e586a.jpg",
                 ...
             ]
          }
        }
      }
    }
  } 
</script>

This structured data can be conveniently extracted using CSS selectors:

import json
from parsel import Selector

product_html = # fetch product page HTML 

# Extract JSON object
json_data = Selector(product_html).css(‘script#__NEXT_DATA__::text‘).get()

# Deserialize JSON 
data = json.loads(json_data)

# Drill down to product data
product = data[‘props‘][‘pageProps‘][‘productPageReducer‘][‘productData‘]

print(product[‘title‘])
print(product[‘price‘])
print(product[‘images‘])

The key steps are:

Use Selector to extract the <script> tag containing the JSON data
Deserialize the JSON string into a Python dictionary with json.loads()
Drill down into the nested structure to get the product details

And we have cleanly extracted the core product data! The same approach applies to any product URL.

This JSON technique is very convenient compared to parsing HTML directly. Many modern websites use similar JSON data structures with all the underlying product data.

Crawl Category Listings

Now that we can extract data for individual products, let‘s look at collecting data at scale.

We‘ll start with crawling category pages which contain paginated listings for a designer. Here‘s an example Chanel category:

https://www.fashionphile.com/shop/chanel-bags

This page contains 72 products spread over 4 pages. To extract all listings we‘ll need to:

Fetch the page HTML
Extract total number of pages from the HTML
Loop through each page and extract products

Here‘s how it looks in Python:

import math
from parsel import Selector  

category_url = ‘https://www.fashionphile.com/shop/chanel-bags‘

def scrape_products(html):
  # Extract JSON data
  json = Selector(html).css(‘script#__NEXT_DATA__::text‘).get()

  # Get products array
  data = json.loads(json)  
  products = data[‘props‘][‘pageProps‘][‘categoryPageReducer‘][‘categoryData‘][‘products‘]

  return products

def get_total_pages(html):
  # Extract pagination data
  json = Selector(html).css(‘script#__NEXT_DATA__::text‘).get()
  data = json.loads(json)

  # Calculate pages
  total_products = data[‘props‘][‘pageProps‘][‘categoryPageReducer‘][‘categoryData‘][‘totalProducts‘]
  per_page = data[‘props‘][‘pageProps‘][‘categoryPageReducer‘][‘categoryData‘][‘productsPerPage‘]

  total_pages = math.ceil(total_products / per_page)  

  return total_pages

# Fetch first page  
first_page_html = # request first page

# Get products from page
products = scrape_products(first_page_html) 

# Get total pages
total_pages = get_total_pages(first_page_html)

# Crawl remaining pages
for page in range(2, total_pages+1):
  url = f‘{category_url}?page={page}‘
  html = # fetch page HTML
  page_products = scrape_products(html)

  # Combine with master products list
  products.extend(page_products)

print(len(products))
# Prints total products scraped across all pages

The key steps are:

Parse total pages from first page HTML
Iterate through each page by incrementing the ?page= parameter
Extract products JSON from each page
Combine all products into a single list

This paginated crawling technique can be applied to any category, search or sorted listings page.

Leverage Sitemaps to Crawl All Product URLs

Now that we can extract products from individual pages, we need a way to discover all product URLs to crawl.

One easy option is to leverage the Fashionphile sitemap.

Sitemaps provide a list of all URLs on a website for search engine crawlers. Here is the Fashionphile sitemap index:

https://www.fashionphile.com/sitemap-index.xml

This XML file contains references to all the individual sitemap files. We first need to parse out all the listed sitemaps:

import xml.etree.ElementTree as ET

sitemap_index = ‘https://www.fashionphile.com/sitemap-index.xml‘

# Parse XML 
root = ET.fromstring(requests.get(sitemap_index).text)

# Get all sitemap URLs 
sitemaps = [elem.text for elem in root.findall(‘./sitemap/loc‘)]

print(sitemaps)

This prints out URLs of all the sub-sitemaps like:

[‘https://s3-us-west-2.amazonaws.com/fashionphile/sitemaps/category-sitemap1.xml‘,
 ‘https://s3-us-west-2.amazonaws.com/fashionphile/sitemaps/category-sitemap2.xml‘,
 ‘https://s3-us-west-2.amazonaws.com/fashionphile/sitemaps/category-sitemap3.xml‘,
 ...]

We can then loop through each sub-sitemap and extract product URLs:

product_urls = []

for sitemap in sitemaps:

  # Parse XML
  root = ET.fromstring(requests.get(sitemap).text)  

  # Get all product URLs
  urls = [elem.text for elem in root.findall(‘./url/loc‘)]

  # Add to master list
  product_urls.extend(urls)

print(len(product_urls))
# Prints total product URLs

So using the sitemap index we can aggregate a master list of all 500k+ product URLs on Fashionphile!

We can then feed this into our product crawl code to scrape fresh data for every product listing on the site.

Crawl Based on Historical Listings

Sitemaps provide the most up-to-date view of listings on Fashionphile.

Another approach is to leverage historical scrape data to guide crawls.

The methodology would be:

Do a large scale scrape of Fashionphile and store product listings
In each subsequent scrape, parse out all listings from the historical data
For each historical URL, check if it still returns a 200 status code
- If 200 OK, scrape updated product data
- If 404 Not Found, remove deleted listing from dataset
For any new listings, add to dataset

This allows keeping the scraped dataset in sync with latest inventory and prices on Fashionphile.

You can even track price history and other trends over time as products get updated. The historical data powers more advanced analytics.

Scrape Data At Scale with Scrapy

All the techniques covered so far use basic Python libraries like Requests and Parsel for scraping.

For large scale production scrapes, a dedicated web crawling framework like Scrapy is recommended.

Scrapy provides several advantages:

Powerful crawling with in-built scheduler, downloader, spider classes etc.
Asynchronous IO for very fast scraping
Built-in caching, throttling, cookies, proxy handling
item pipelines to cleanly store scraped data
Extensive debugging information in logs
Handy crawl output like stats and reports

Here is a simple Scrapy spider to crawl Fashionphile products:

import scrapy
import json

class FashionphileSpider(scrapy.Spider):

  name = ‘fashionphile‘

  def start_requests(self):
    yield scrapy.Request(url=‘https://www.fashionphile.com/shop/louis-vuitton-bags‘)

  def parse(self, response):

    # Extract JSON data
    json_data = response.css(‘script#__NEXT_DATA__::text‘).get()
    data = json.loads(json_data)

    # Yield Product Items
    for product in data[‘props‘][‘pageProps‘][‘categoryPageReducer‘][‘categoryData‘][‘products‘]:
       yield {
         ‘title‘: product[‘title‘],
         ‘price‘: product[‘price‘],
         ‘url‘: product[‘url‘],
       }

    # Crawl to next page
    next_page = response.css(‘a.next::attr(href)‘).get()
    if next_page:
      yield scrapy.Request(url=next_page, callback=self.parse)

This implements pagination logic and product parsing in a much cleaner fashion.

The spider can then be run to efficiently crawl 10s of thousands of listings:

scrapy crawl fashionphile -o products.json

Overall, Scrapy provides a very scalable platform for production web scraping.

Use Proxies to Avoid Blocking

A challenge with scraping any commercial site at scale is inevitably getting blocked.

Fashionphile employs several anti-scraping mechanisms:

Rate Limiting – Limits number of requests sent from a single IP.
caps – Requires solving CAPTCHAs after a certain number of requests.
IP Blocking – Bans IP addresses doing large volumes of requests.

Rotating proxies is an effective strategy to avoid these limitations:

Shared Proxies – Use proxy services like BrightData, Oxylabs, Smartproxy etc. to get thousands of shared residential and mobile proxies.
Self-hosted Proxies – For fully private proxies, use proxy management tools like ProxyMesh to orchestrate servers across datacenters.
Proxy Rotation – Rotate proxies per request or every few hundred requests to constantly use new IPs.

Here is an example using BrightData proxies with each Scrapy request:

from scrapy import signals

class ProxyMiddleware:

    def process_request(self, request, spider):
        request.meta[‘proxy‘] = f‘http://{get_proxy()}‘

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        return request

def get_proxy():
    # Call BrightData API to get proxy
    return proxy 

class FashionphileSpider(scrapy.Spider):

  custom_settings = {
    ‘DOWNLOADER_MIDDLEWARES‘: {
      ‘scrapy_proxies.RandomProxy‘: 100,
      ‘ProxyMiddleware‘: 500
    }
  }

  # Rest of spider code

The key pieces are:

Enabling the ProxyMiddleware
Getting a new proxy before each request
Passing proxy via request.meta

This ensures every request uses a different IP and avoids any blocks.

Conclusion

Scraping a site like Fashionphile requires dealing with modern JavaScript rendered pages, pagination, proxies and more.

Here are some key takeaways:

Leverage JSON data structures instead of parsing HTML directly
Paginate through category/search pages to scale data
Use sitemaps or historical data to get all product URLs
Move to Scrapy once at production scale
Rotate proxies to avoid bot protection

The data available is incredibly valuable for market research, monitoring competitors, lead generation and more.

Putting in the engineering effort pays dividends across business uses cases. This guide should provide a blueprint to build your own Fashionphile scraper.

Let me know if you have any other questions! I‘m happy to help provide any other tips.

Why Scrape Fashionphile?

Overview of Fashionphile‘s Website

Scrape Product Page Data

Crawl Category Listings

Leverage Sitemaps to Crawl All Product URLs

Crawl Based on Historical Listings

Scrape Data At Scale with Scrapy

Use Proxies to Avoid Blocking

Conclusion

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python