Skip to content

How to Scrape Data from Realtor.com: A Comprehensive Guide

Realtor.com is one of the most popular real estate listing websites, containing data on millions of properties for sale and rent across the United States. For real estate professionals, investors, and data scientists, the information on realtor.com is a valuable resource. However, manually searching and aggregating data from the site can be extremely time-consuming.

In this in-depth guide, we‘ll walk through the process of programmatically scraping data from realtor.com using Python and Selenium. We‘ll also see how to use the ScrapingBee service to solve CAPTCHAs and scale up your realtor.com scraping efforts. By the end, you‘ll have a fully functional realtor.com scraper that you can adapt for your own data needs.

Setting Up Your Python Environment

Before we start writing code, you‘ll need to make sure you have Python and a few key libraries installed. This guide assumes you are using Python 3.6+. If you don‘t already have Python, you can download an installer for Windows, macOS, or Linux at python.org.

With Python ready to go, create a new folder for your scraping project. Inside that folder, create a new Python file, e.g. realtor_scraper.py.

Now open up a terminal or command prompt, navigate to your project folder, and run the following commands to install the necessary Python packages:

pip install selenium
pip install undetected-chromedriver

This will install:

  1. Selenium – A browser automation toolkit we‘ll use to interact with realtor.com
  2. undetected-chromedriver – An optimized version of Selenium‘s ChromeDriver that helps avoid bot detection

Fetching Realtor.com Pages with Selenium

Selenium allows us to launch and control a browser through code. We‘ll be using Google Chrome in this guide. Add the following code to your Python file:

import undetected_chromedriver as uc

driver = uc.Chrome()

driver.get("https://www.realtor.com/realestateandhomes-search/Miami_FL")

This code launches Chrome (using undetected_chromedriver to mask the fact we are automating the browser) and navigates to the realtor.com search results page for Miami, FL.

You can change the URL to load results for any location. Just replace Miami_FL with the desired city and state abbreviation, e.g. New-York_NY or Chicago_IL.

Running this code, you should see a Chrome window open and load the realtor.com Miami listings page.

Analyzing Realtor.com‘s Page Structure

To extract data from the page, we need to determine the relevant HTML elements and how to uniquely locate them. The browser Developer Tools make this easy.

In Chrome, right-click on any listing on the page and select "Inspect" to open the Developer Tools. In the Elements panel, you can see the HTML structure of the page.

    <img src="/blog/web-scraping-realtor/listing-inspect.png" alt="Inspecting listing HTML">

By exploring the HTML, we can see that each listing is contained in an li element with the attribute data-testid="result-card". Inside these lis, the different data points we want to extract (price, beds, bath, etc.) are contained in elements with data-label attributes.

For example, the price is in a span with data-label="pc-price", while the address is in a div with data-label="pc-address".

Extracting Listing Data with Selenium

Armed with this information about the page structure, we can use Selenium and CSS selectors to locate and extract the data we want.

First, let‘s grab all the individual listings on the page:

from selenium.webdriver.common.by import By

listings = driver.find_elements(By.CSS_SELECTOR, ‘li[data-testid="result-card"]‘)

find_elements() returns a list of all elements matching the given selector. Here we use a CSS attribute selector to find all li elements with a data-testid of "result-card", which correspond to the individual listings.

Now let‘s loop through the listings and extract the desired attributes:

data = []
for listing in listings:
    datum = {}

    # Price
    price = listing.find_element(By.CSS_SELECTOR, ‘span[data-label="pc-price"]‘).text
    datum[‘price‘] = price

    # Address
    address = listing.find_element(By.CSS_SELECTOR, ‘div[data-label="pc-address"]‘).text
    datum[‘address‘] = address

    # Beds
    try:
        beds = listing.find_element(By.CSS_SELECTOR, ‘li[data-label="pc-meta-beds"]‘).text
        datum[‘beds‘] = beds
    except:
        datum[‘beds‘] = None

    # Baths
    try:    
        baths = listing.find_element(By.CSS_SELECTOR, ‘li[data-label="pc-meta-baths"]‘).text
        datum[‘baths‘] = baths
    except:
        datum[‘baths‘] = None

    # Square Feet
    try:
        sqft = listing.find_element(By.CSS_SELECTOR, ‘li[data-label="pc-meta-sqft"]‘).text  
        datum[‘sqft‘] = sqft
    except:
        datum[‘sqft‘] = None

    data.append(datum)

For each listing li, we use find_element() to locate the child elements containing the target data points by their data-label attributes. We extract the text of these elements and store them in a dictionary which is appended to a running data list.

Note the try/except blocks when extracting beds, baths, and sqft. Some listings may not have this information available, so we need to handle the case where no matching element is found to avoid throwing an error.

At the end of this, data will be a list of dictionaries, each containing the extracted data for one listing:

[
    {
        ‘price‘: ‘$320,000‘, 
        ‘address‘: ‘9273 SW 227th St, Cutler Bay, FL 33190‘,
        ‘beds‘: ‘3 beds‘,
        ‘baths‘: ‘2 baths‘,
        ‘sqft‘: ‘1,441 sqft‘
    },
    {
        ‘price‘: ‘$1,150,000‘, 
        ‘address‘: ‘1024 Biscayne Blvd # UPH07, Miami, FL 33132‘,
        ‘beds‘: ‘1 bed‘,
        ‘baths‘: ‘2 baths‘, 
        ‘sqft‘: None
    },
    ...
]

Paginating Through All Search Results

The realtor.com search results are paginated, with a limited number of listings shown per page. To scrape all results, we need to navigate through the pages in our script.

Luckily, realtor.com provides a "Next" link to load the next page of results. By clicking this link in a loop until it no longer appears, we can scrape all paginated results.

Here‘s how we can modify our script to handle pagination:

data = []

while True:
    listings = driver.find_elements(By.CSS_SELECTOR, ‘li[data-testid="result-card"]‘)

    for listing in listings:
        # Extract data from each listing as shown in previous section
        ...

    try:
        next_link = driver.find_element(By.XPATH, ‘//a[@aria-label="Go to next page"]‘)
        next_link.click()
    except:
        break

We wrap our existing code to extract data from each listing in a while True loop. After processing the listings on the current page, we attempt to locate the "Next" link (identifiable by its aria-label attribute) and click it to load the next page.

If no "Next" link is found, we assume we‘ve reached the last page of results and break out of the pagination loop.

With this modification, our script will now extract data from all listings across all pages of search results.

Handling CAPTCHAs with ScrapingBee

If you run the realtor.com scraper for long enough, you will likely encounter a CAPTCHA page. This is a common anti-bot measure used by websites to prevent scraping.

Solving CAPTCHAs is a complex topic and often requires paid services. One such service is ScrapingBee, which provides an API to handle interactions with websites, including solving CAPTCHAs.

Here‘s how you can offload your realtor.com scraping to ScrapingBee and avoid CAPTCHAs and IP blocking:

First, install the ScrapingBee Python SDK:

pip install scrapingbee

Next, sign up for a ScrapingBee account at https://app.scrapingbee.com/signup to get an API key.

Now you can use ScrapingBee to scrape realtor.com as follows:

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)

response = client.get(
    ‘https://www.realtor.com/realestateandhomes-search/Miami_FL‘,
    params = {
        ‘extract_rules‘: {
            ‘listings‘: {
                ‘selector‘: ‘li[data-testid="result-card"]‘, 
                ‘type‘: ‘list‘,
                ‘output‘: {
                    ‘price‘: ‘span[data-label="pc-price"]‘,
                    ‘address‘: ‘div[data-label="pc-address"]‘, 
                    ‘beds‘: ‘li[data-label="pc-meta-beds"]‘,
                    ‘baths‘: ‘li[data-label="pc-meta-baths"]‘,
                    ‘sqft‘: ‘li[data-label="pc-meta-sqft"]‘
                }
            },
            ‘next_page‘: {
                ‘selector‘: ‘//a[@aria-label="Go to next page"]‘, 
                ‘output‘: ‘@href‘
            }
        },
        ‘wait_for‘: ‘ul.jsx-4195823979‘,
        ‘premium_proxy‘: ‘true‘
    }
)

data = response.json()

This code uses ScrapingBee‘s extract_rules feature to specify the CSS selectors for the data we want to extract. ScrapingBee will handle rendering the JavaScript on the page (via the wait_for parameter), extract the data according to our rules, and return it as structured JSON.

ScrapingBee also manages proxy rotation and CAPTCHA solving behind the scenes, so we don‘t need to worry about our scraper getting blocked.

To fetch subsequent pages of results, you can check for the presence of data[‘next_page‘] and if it exists, make another request to ScrapingBee with that URL to get the next page of listings.

Scaling Up with Scrapy

For more complex scraping projects, you may want to consider using a dedicated scraping framework like Scrapy.

Scrapy is a powerful and extensible Python library for building web scrapers. It provides a lot of useful features out-of-the-box, such as:

  • Built-in support for generating and following links to crawl entire sites
  • Robust selection and extraction using CSS and XPath selectors
  • Feed exports to store scraped data in JSON, CSV, XML and more
  • Throttling and politeness controls to avoid overloading servers
  • Middleware for filtering requests, handling cookies, authentication, compression, etc.
  • Extensions for caching, stats collection, telnet consoles and more

Converting a Selenium-based realtor.com scraper to Scrapy would allow you to scale up your scraping efforts significantly. You could easily parallelize the scraping, persist progress to pause and resume jobs, rotate user agents and IP addresses, and much more.

Conclusion

In this guide, we‘ve seen how to scrape data from realtor.com using Python and Selenium. The key steps are:

  1. Use Selenium to fetch and render pages
  2. Analyze the page HTML to determine selectors for target data points
  3. Locate and extract the data using Selenium‘s find_element(s) methods
  4. Handle pagination by finding and clicking "Next" links
  5. Integrate ScrapingBee to avoid CAPTCHAs and IP blocking
  6. Consider Scrapy for larger-scale, production-ready scraping projects

While web scraping can be a powerful tool for gathering data, it‘s important to respect website terms of service and be a good net citizen by limiting request rates and only scraping publicly available data.

With the techniques outlined here, you should be able to build a realtor.com scraper to extract the data you need. If you have any questions or get stuck along the way, feel free to reach out – we‘re always happy to help with your web scraping projects!

Join the conversation

Your email address will not be published. Required fields are marked *