As a real estate investor and data scientist, I‘m fascinated by the insights hiding in real estate listing data. Redfin, one of the largest real estate sites with over 5 million monthly visitors, offers an abundance of juicy listing details just waiting to be extracted.
In this step-by-step guide, you‘ll learn how I use Python to scrape key Redfin real estate data to fuel my property investment algorithms.
By the end, you‘ll be able to build your own Redfin listing scraper to amass datasets on pricing, housing market trends and more. Let‘s dive in!
Why Scrape Redfin Listing Data?
Before we get into the technical nitty gritty, let‘s discuss why you may want to scrape Redfin listings in the first place.
1. Market Research
Redfin has over 2 million active real estate listings across the US – that‘s a massive dataset for analyzing pricing, demand and trends across regions. Extracting and crunching this data can reveal valuable market insights.
2. Competitor Research
Studying what homes your competitor real estate agents have sold and for what price points can give you an edge in negotiations.
3. Algorithm Development
Scraped real estate data can serve as input for automated home valuation algorithms using machine learning. With enough historical examples you can make accurate price predictions.
4. Home Search Tool
Scrape all listings in a city updated daily to build a super-charged real estate search tool or investment dashboard.
In short, if you can extract and leverage Redfin‘s abundant listing data, the possibilities are endless. Now let‘s see how to do just that.
Overview of Available Redfin Listing Data
Before scraping, it helps to understand exactly what data we can extract from Redfin. Here are some of the main data fields available on a typical listing page:
Address Details
- Full address
- ZIP code
- County
- Lot size
- Parcel ID
Price Info
- Current listing price
- Price history and changes
- Price drop details
- Price per square footage
Property Details
- Square footage
- Number of bedrooms
- Number of bathrooms
- Built year
- Property type (house, condo, multi-family)
Description
- Full listing description
- List of amenities
Photos & Virtual Tour
- Main listing photos
- Additional neighborhood photos
- 3D Scan Tour (if available)
- Drone photos (if available)
Agent Info
- Listing agent details
- Agent photo
- Office details
- Agent profile & contact info
Interactive Maps
- Boundary outlines
- School district shapes
- Commute time layers
- Neighborhood insights
And much more…
As you can see, each Redfin listing contains a treasure trove of data. Now let‘s see how we can extract it.
Scraping a Redfin Listing Page
There are a couple approaches to scraping a Redfin listing:
- Parse the rendered HTML – Messy and prone to breaking
- Extract raw JSON data – Future proof method
I highly recommend scraping the JSON. Redfin uses React to render listing pages, loading data asynchronously from internal JSON APIs.
We can mimic the browser and fetch this data directly. Here is a Python code snippet to extract the raw listing JSON:
import requests
import re
import json
listing_url = "https://www.redfin.com/CA/San-Francisco/123-Main-St-94104/home/123456"
response = requests.get(listing_url)
match = re.search(r‘window.__data__ = (.+);‘, response.text)
json_data = json.loads(match.group(1))
print(json_data.keys())
This reaches into the guts of the Redfin page to pull out the complete listing data. Much easier than parsing the rendered HTML!
The JSON structure is nested, containing every piece of data displayed. Let‘s see how to parse it into something more usable.
Parsing Key Listing Data Fields
The raw listing JSON contains a ton of unnecessary technical meta-data. We want to extract just the important stuff.
Here‘s an example parser to pull fields into a clean Python dictionary:
def parse_listing(data):
return {
"price": data["homeDetails"]["price"],
"address": data["homeDetails"]["address"]["line1"],
"beds": data["homeDetails"]["beds"],
"baths": data["homeDetails"]["baths"],
"sqft": data["homeDetails"]["sqft"],
"built": data["homeDetails"]["yearBuilt"],
"desc": data["homeDetails"]["description"],
"photos": [p["highResPath"] for p in data["homeDetails"]["photos"]],
}
data = parse_listing(json_data)
print(data)
With some simple JSON traversal we can extract key fields into usable formats.
Now a caveat… while the above works, JSON structures can change unexpectedly, breaking scrapers. Best practice is to wrap parsing logic in try/except blocks to gracefully handle any field changes Redfin may push.
This ensures your scraper keeps chugging along, instead of crashing.
Finding Redfin Listing Pages to Scrape
Alright, we can scrape individual listing pages… but how do we find listings to feed into our scraper?
Redfin offers a few options for sourcing listing URLs:
Search Pages
Simulate searches through Redfin‘s website, paginating by adjusting the page
parameter:
https://www.redfin.com/city/37449/CA/San-Francisco/filter/property-type=house,max-price=1.5M,page=2
Pros:
- Mirrors user search experience
Cons:
- Easy to get blocked scraping aggressively
- Hard to paginate all listings
Location Sitemaps
Redfin provides detailed sitemaps splitting listings by city, county, neighborhood, ZIP code and more.
For example:
https://www.redfin.com/sitemap-location-CA-San-Francisco-buy.xml
https://www.redfin.com/sitemap-neighborhood-CA-San-Francisco-Mission-District-buy.xml
Pros:
- Structured sitemap format
- Granularly segment listings
Cons:
- Sitemaps need to be manually discovered
Newest/Latest Sitemaps
Special sitemaps provide newly listed properties as well as recent listing updates:
https://www.redfin.com/newest_listings.xml
https://www.redfin.com/updated_listings.xml
Pros:
- Get fresh data as it appears
Cons:
- Only new or updated listings
Crawling Related Listings
Each listing contains "Similar Homes" we can recursively crawl:
Pros:
- Expand to related listings
Cons:
- Easy to crawl unrelated homes
My recommended approach is to combine sitemaps for breadth with crawling for depth.
Now let‘s look at continuously collecting the freshest listings.
Tracking New and Updated Listing Data
While scraping historical listing data is useful, we ideally want new and updated listings as they appear to keep our dataset current.
Redfin provides two key sitemaps for this:
Newest Sitemap
This feed lists newly published listings with the publish timestamp:
<url>
<loc>https://www.redfin.com/CA/San-Francisco/456-Park-St-94102/home/4567123</loc>
<lastmod>2023-03-01T22:10:35Z</lastmod>
</url>
Latest Sitemap
The latest sitemap indexes recently updated listings:
<url>
<loc>https://www.redfin.com/CA/San-Francisco/123-Main-St-94104/home/123456</loc>
<lastmod>2023-03-01T22:18:26Z</lastmod>
</url>
Here is sample Python code to continuously scrape these sitemaps:
import requests
from lxml import etree
from datetime import datetime
INIT_TIME = datetime(2023, 2, 1) # initial run
def scrape_sitemaps():
newest_url = "https://www.redfin.com/newest_listings.xml"
latest_url = "https://www.redfin.com/latest_listings.xml"
new_listings = []
# fetch newest sitemap and parse listings
response = requests.get(newest_url)
dom = etree.XML(response.content)
for url_el in dom.xpath("//url"):
name = url_el.find("loc").text
published = url_el.find("lastmod").text
published_dt = datetime.strptime(published, "%Y-%m-%dT%H:%M:%S.%fZ")
# filter only new listings since initial run
if published_dt > INIT_TIME:
new_listings.append(name)
# same process for latest sitemap
response = requests.get(latest_url)
dom = etree.XML(response.content)
for url_el in dom.xpath("//url"):
name = url_el.find("loc").text
updated = url_el.find("lastmod").text
updated_dt = datetime.strptime(updated, "%Y-%m-%dT%H:%M:%S.%fZ")
if updated_dt > INIT_TIME and name not in new_listings:
new_listings.append(name)
# scrape new listings...
for url in new_listings:
data = scrape_listing(url)
print("Scraped new listing:", url)
# save listings to database...
Scheduling this to run every hour ensures you catch the latest real estate data.
Now let‘s look at how to scrape Redfin listings at scale without getting blocked.
Avoiding Blocks at Scale with Proxies
While our scrapers work fine extracting a few listings, scraping aggressively will likely get you blocked.
Redfin employs advanced bot protection and abuse prevention monitoring to prevent large scale scraping.
To scrape safely at scale, we need proxies. Proxies provide new IP addresses to route each request, distributing the load and avoiding patterns.
There are two common approaches:
1. Residential Proxies
Using proxy IPs from residential ISPs like Comcast or Verizon mimics real home users, making you appear less suspicious.
2. Datacenter Proxies
Proxies from cloud providers like Amazon AWS and Google Cloud blend in with normal web traffic.
I recommend a mix of both proxy types for optimal performance and avoidance of blocks.
Let‘s look at an example using the BrightData proxy API:
from brightdata.sdk import BrightData
proxy = BrightData("YOUR_API_KEY")
urls = [# list of URLs to scrape...
]
for url in urls:
with proxy.browser(residential=True) as browser:
browser.get(url)
html = browser.page_source
# parse listing data from HTML...
with proxy.browser(datacenter=True) as browser:
browser.get(url)
html = browser.page_source
# parse listing data from HTML...
The API handles proxy rotation each request, letting us focus on data extraction.
With the right proxy solution, you can scrape Redfin listings at scale without breakages.
Scraping Redfin Listings: Next Steps
We covered a ton of ground in this guide! Let‘s recap the key points:
- Redfin provides a wealth of real estate listing data
- We can extract JSON data instead of parsing HTML
- Several techniques exist for finding listings to scrape
- Tracking newest/latest sitemaps identifies fresh listings
- Proxies help avoid blocks when scraping at scale
The code examples provide a template – with more error handling and refinements you can build a robust Redfin listing scraper.
Potential next steps include:
- Storing data in a SQL or NoSQL database
- Building a custom API to serve aggregated listing data
- Analyzing data to reveal insights into pricing, demand and more
- Creating a real estate tool or dashboard on top of scraped listings
The possibilities are endless once you unlock Redfin‘s data at scale. Just be sure to follow ethical scraping practices and respect reasonable usage limits.
I hope this guide provides a solid starting point for your own Redfin scraping projects. Scrap on!