How to Scrape Redfin Real Estate Property Data in Python - The Complete Guide - Web Scraping Site

As a real estate investor and data scientist, I‘m fascinated by the insights hiding in real estate listing data. Redfin, one of the largest real estate sites with over 5 million monthly visitors, offers an abundance of juicy listing details just waiting to be extracted.

In this step-by-step guide, you‘ll learn how I use Python to scrape key Redfin real estate data to fuel my property investment algorithms.

By the end, you‘ll be able to build your own Redfin listing scraper to amass datasets on pricing, housing market trends and more. Let‘s dive in!

Why Scrape Redfin Listing Data?

Before we get into the technical nitty gritty, let‘s discuss why you may want to scrape Redfin listings in the first place.

1. Market Research

Redfin has over 2 million active real estate listings across the US – that‘s a massive dataset for analyzing pricing, demand and trends across regions. Extracting and crunching this data can reveal valuable market insights.

2. Competitor Research

Studying what homes your competitor real estate agents have sold and for what price points can give you an edge in negotiations.

3. Algorithm Development

Scraped real estate data can serve as input for automated home valuation algorithms using machine learning. With enough historical examples you can make accurate price predictions.

4. Home Search Tool

Scrape all listings in a city updated daily to build a super-charged real estate search tool or investment dashboard.

In short, if you can extract and leverage Redfin‘s abundant listing data, the possibilities are endless. Now let‘s see how to do just that.

Overview of Available Redfin Listing Data

Before scraping, it helps to understand exactly what data we can extract from Redfin. Here are some of the main data fields available on a typical listing page:

Address Details

Full address
ZIP code
County
Lot size
Parcel ID

Price Info

Current listing price
Price history and changes
Price drop details
Price per square footage

Property Details

Square footage
Number of bedrooms
Number of bathrooms
Built year
Property type (house, condo, multi-family)

Description

Full listing description
List of amenities

Photos & Virtual Tour

Main listing photos
Additional neighborhood photos
3D Scan Tour (if available)
Drone photos (if available)

Agent Info

Listing agent details
Agent photo
Office details
Agent profile & contact info

Interactive Maps

Boundary outlines
School district shapes
Commute time layers
Neighborhood insights

And much more…

As you can see, each Redfin listing contains a treasure trove of data. Now let‘s see how we can extract it.

Scraping a Redfin Listing Page

There are a couple approaches to scraping a Redfin listing:

Parse the rendered HTML – Messy and prone to breaking
Extract raw JSON data – Future proof method

I highly recommend scraping the JSON. Redfin uses React to render listing pages, loading data asynchronously from internal JSON APIs.

We can mimic the browser and fetch this data directly. Here is a Python code snippet to extract the raw listing JSON:

import requests
import re
import json

listing_url = "https://www.redfin.com/CA/San-Francisco/123-Main-St-94104/home/123456"

response = requests.get(listing_url)

match = re.search(r‘window.__data__ = (.+);‘, response.text)
json_data = json.loads(match.group(1))

print(json_data.keys())

This reaches into the guts of the Redfin page to pull out the complete listing data. Much easier than parsing the rendered HTML!

The JSON structure is nested, containing every piece of data displayed. Let‘s see how to parse it into something more usable.

Parsing Key Listing Data Fields

The raw listing JSON contains a ton of unnecessary technical meta-data. We want to extract just the important stuff.

Here‘s an example parser to pull fields into a clean Python dictionary:

def parse_listing(data):

  return {
    "price": data["homeDetails"]["price"],
    "address": data["homeDetails"]["address"]["line1"],
    "beds": data["homeDetails"]["beds"],
    "baths": data["homeDetails"]["baths"],
    "sqft": data["homeDetails"]["sqft"],
    "built": data["homeDetails"]["yearBuilt"],
    "desc": data["homeDetails"]["description"],
    "photos": [p["highResPath"] for p in data["homeDetails"]["photos"]],
  }

data = parse_listing(json_data)
print(data)

With some simple JSON traversal we can extract key fields into usable formats.

Now a caveat… while the above works, JSON structures can change unexpectedly, breaking scrapers. Best practice is to wrap parsing logic in try/except blocks to gracefully handle any field changes Redfin may push.

This ensures your scraper keeps chugging along, instead of crashing.

Finding Redfin Listing Pages to Scrape

Alright, we can scrape individual listing pages… but how do we find listings to feed into our scraper?

Redfin offers a few options for sourcing listing URLs:

Search Pages

Simulate searches through Redfin‘s website, paginating by adjusting the page parameter:

https://www.redfin.com/city/37449/CA/San-Francisco/filter/property-type=house,max-price=1.5M,page=2

Pros:

Mirrors user search experience

Cons:

Easy to get blocked scraping aggressively
Hard to paginate all listings

Location Sitemaps

Redfin provides detailed sitemaps splitting listings by city, county, neighborhood, ZIP code and more.

For example:

https://www.redfin.com/sitemap-location-CA-San-Francisco-buy.xml
https://www.redfin.com/sitemap-neighborhood-CA-San-Francisco-Mission-District-buy.xml

Pros:

Structured sitemap format
Granularly segment listings

Cons:

Sitemaps need to be manually discovered

Newest/Latest Sitemaps

Special sitemaps provide newly listed properties as well as recent listing updates:

https://www.redfin.com/newest_listings.xml
https://www.redfin.com/updated_listings.xml

Pros:

Get fresh data as it appears

Cons:

Only new or updated listings

Crawling Related Listings

Each listing contains "Similar Homes" we can recursively crawl:

Pros:

Expand to related listings

Cons:

Easy to crawl unrelated homes

My recommended approach is to combine sitemaps for breadth with crawling for depth.

Now let‘s look at continuously collecting the freshest listings.

Tracking New and Updated Listing Data

While scraping historical listing data is useful, we ideally want new and updated listings as they appear to keep our dataset current.

Redfin provides two key sitemaps for this:

Newest Sitemap

This feed lists newly published listings with the publish timestamp:

<url>
  <loc>https://www.redfin.com/CA/San-Francisco/456-Park-St-94102/home/4567123</loc>
  <lastmod>2023-03-01T22:10:35Z</lastmod>
</url>

Latest Sitemap

The latest sitemap indexes recently updated listings:

<url>
  <loc>https://www.redfin.com/CA/San-Francisco/123-Main-St-94104/home/123456</loc>
  <lastmod>2023-03-01T22:18:26Z</lastmod>
</url>

Here is sample Python code to continuously scrape these sitemaps:

import requests
from lxml import etree
from datetime import datetime

INIT_TIME = datetime(2023, 2, 1) # initial run

def scrape_sitemaps():

  newest_url = "https://www.redfin.com/newest_listings.xml"
  latest_url = "https://www.redfin.com/latest_listings.xml"

  new_listings = []

  # fetch newest sitemap and parse listings
  response = requests.get(newest_url)
  dom = etree.XML(response.content)
  for url_el in dom.xpath("//url"):
    name = url_el.find("loc").text
    published = url_el.find("lastmod").text
    published_dt = datetime.strptime(published, "%Y-%m-%dT%H:%M:%S.%fZ")
    # filter only new listings since initial run
    if published_dt > INIT_TIME:
      new_listings.append(name)

  # same process for latest sitemap
  response = requests.get(latest_url)
  dom = etree.XML(response.content)
  for url_el in dom.xpath("//url"):
   name = url_el.find("loc").text
   updated = url_el.find("lastmod").text
   updated_dt = datetime.strptime(updated, "%Y-%m-%dT%H:%M:%S.%fZ")
   if updated_dt > INIT_TIME and name not in new_listings:
      new_listings.append(name)

  # scrape new listings...
  for url in new_listings:
    data = scrape_listing(url) 
    print("Scraped new listing:", url)
    # save listings to database...

Scheduling this to run every hour ensures you catch the latest real estate data.

Now let‘s look at how to scrape Redfin listings at scale without getting blocked.

Avoiding Blocks at Scale with Proxies

While our scrapers work fine extracting a few listings, scraping aggressively will likely get you blocked.

Redfin employs advanced bot protection and abuse prevention monitoring to prevent large scale scraping.

To scrape safely at scale, we need proxies. Proxies provide new IP addresses to route each request, distributing the load and avoiding patterns.

There are two common approaches:

1. Residential Proxies

Using proxy IPs from residential ISPs like Comcast or Verizon mimics real home users, making you appear less suspicious.

2. Datacenter Proxies

Proxies from cloud providers like Amazon AWS and Google Cloud blend in with normal web traffic.

I recommend a mix of both proxy types for optimal performance and avoidance of blocks.

Let‘s look at an example using the BrightData proxy API:

from brightdata.sdk import BrightData

proxy = BrightData("YOUR_API_KEY")

urls = [# list of URLs to scrape...
]

for url in urls:

  with proxy.browser(residential=True) as browser:
    browser.get(url)
    html = browser.page_source
    # parse listing data from HTML...

  with proxy.browser(datacenter=True) as browser:
    browser.get(url)
    html = browser.page_source
    # parse listing data from HTML...

The API handles proxy rotation each request, letting us focus on data extraction.

With the right proxy solution, you can scrape Redfin listings at scale without breakages.

Scraping Redfin Listings: Next Steps

We covered a ton of ground in this guide! Let‘s recap the key points:

Redfin provides a wealth of real estate listing data
We can extract JSON data instead of parsing HTML
Several techniques exist for finding listings to scrape
Tracking newest/latest sitemaps identifies fresh listings
Proxies help avoid blocks when scraping at scale

The code examples provide a template – with more error handling and refinements you can build a robust Redfin listing scraper.

Potential next steps include:

Storing data in a SQL or NoSQL database
Building a custom API to serve aggregated listing data
Analyzing data to reveal insights into pricing, demand and more
Creating a real estate tool or dashboard on top of scraped listings

The possibilities are endless once you unlock Redfin‘s data at scale. Just be sure to follow ethical scraping practices and respect reasonable usage limits.

I hope this guide provides a solid starting point for your own Redfin scraping projects. Scrap on!

How to Scrape Redfin Real Estate Property Data in Python – The Complete Guide

Why Scrape Redfin Listing Data?

Overview of Available Redfin Listing Data

Scraping a Redfin Listing Page

Parsing Key Listing Data Fields

Finding Redfin Listing Pages to Scrape

Tracking New and Updated Listing Data

Avoiding Blocks at Scale with Proxies

Scraping Redfin Listings: Next Steps

Join the conversation Cancel reply

How to Scrape Redfin Real Estate Property Data in Python – The Complete Guide

Why Scrape Redfin Listing Data?

Overview of Available Redfin Listing Data

Scraping a Redfin Listing Page

Parsing Key Listing Data Fields

Finding Redfin Listing Pages to Scrape

Tracking New and Updated Listing Data

Avoiding Blocks at Scale with Proxies

Scraping Redfin Listings: Next Steps

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python