Skip to content

extract property details

Booking.com is one of the world‘s leading travel booking websites, with over 28 million listings including hotels, apartments, resorts, and more. For those in the travel industry, gathering data on Booking.com‘s listings can provide valuable market intelligence and competitive insights. Web scraping allows you to automatically extract this publicly available data at scale.

In this in-depth guide, we‘ll walk through how to build a robust web scraper for Booking.com search results using Python and Selenium. We‘ll cover:

  • Setting up your web scraping environment
  • Fetching and parsing search result pages
  • Handling pagination to scrape all results
  • Storing the extracted data
  • Tips for making your Booking.com scraper faster and more reliable

Whether you‘re new to web scraping or a seasoned pro, this guide will equip you with the skills needed to extract data from Booking.com and adapt these techniques to scrape data from other websites. Let‘s dive in!

Setting up the Web Scraping Environment

Before we start writing code, you‘ll need to install a few tools and libraries:

  • Python – We‘ll be using Python 3, but the code should work with Python 2.7 as well.
  • Selenium WebDriver – Selenium automates interactions with websites in a real browser. We‘ll use it to load pages and extract data. Install with pip install selenium.
  • ChromeDriver – Selenium needs a driver executable to launch Chrome. Download the version that matches your Chrome installation from here.
  • WebDriver Manager (optional) – Automatically handles downloading the appropriate ChromeDriver version. Install with pip install webdriver-manager.

Once you have the prerequisites ready, create a new Python file, import the required Selenium libraries, and initialize the Chrome WebDriver:


from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

Fetching and Parsing Booking.com Search Results

With Selenium set up, we‘re ready to start scraping! Let‘s begin by loading a Booking.com search results page:


url = "https://www.booking.com/searchresults.html?ss=London&efdco=1"
driver.get(url)

This directs Selenium to fetch the page for hotels in London. Note the efdco=1 parameter in the URL – this tells Booking.com to load all results upfront rather than lazy loading, which makes scraping easier.

Now let‘s parse the loaded page to extract data on the individual properties. We‘ll fetch the following attributes:

  • Name
  • Address
  • Price
  • Rating
  • Number of reviews

We can locate the elements containing each of these attributes using CSS selectors. Selenium‘s find_elements() method allows us to find all matching elements.

Here‘s how to extract a list of property cards from the search results:


property_cards = driver.find_elements(By.CSS_SELECTOR, ‘div[data-testid="property-card"]‘)

To get the name and link for each property:


for card in property_cards:
name = card.find_element(
By.CSS_SELECTOR, ‘div[data-testid="title"]‘
).text

link = card.find_element(
  By.CSS_SELECTOR, ‘a[data-testid="title-link"]‘  
).get_attribute(‘href‘)

Similarly, we can fetch the address and price:


address = card.find_element(
By.CSS_SELECTOR, ‘[data-testid="address"]‘
).text

price = card.find_element(
By.CSS_SELECTOR, ‘[data-testid="price-and-discounted-price"]‘
).text

Finally, here‘s how to extract the rating and number of reviews:


reviews = card.find_element(
By.CSS_SELECTOR, ‘[data-testid="review-score"]‘
).text
rating, num_reviews = reviews.split(‘\n‘)

Putting it all together, we can extract a full list of dictionaries representing each property:


properties = []

for card in property_cards:
name = card.find_element(By.CSS_SELECTOR, ‘div[data-testid="title"]‘).text
link = card.find_element(By.CSS_SELECTOR, ‘a[data-testid="title-link"]‘).get_attribute(‘href‘)
address = card.find_element(By.CSS_SELECTOR, ‘[data-testid="address"]‘).text
price = card.find_element(By.CSS_SELECTOR, ‘[data-testid="price-and-discounted-price"]‘).text

reviews = card.find_element(By.CSS_SELECTOR, ‘[data-testid="review-score"]‘).text
rating, num_reviews = reviews.split(‘\n‘)

properties.append({
‘name‘: name,
‘link‘: link,
‘address‘: address,
‘price‘: price,
‘rating‘: rating,
‘number_of_reviews‘: num_reviews
})

Paginating Through Search Results

To scrape properties beyond the first page of search results, we need to click the "Next" button and repeat the parsing process on each page.

We can find the next button using its accessibility label:


next_button = driver.find_element(
By.XPATH, ‘//button[@aria-label="Next page"]‘
)

To determine how many pages of results there are in total:


page_nums = driver.find_elements(
By.CSS_SELECTOR, ‘nav[aria-label="Pagination"] button[data-testid="pagination-button"]‘
)
num_pages = int(page_nums[-1].text) if page_nums else 1

Putting it all together, here‘s how to loop through all pages and accumulate the scraped properties:


all_properties = []

while True:
property_cards = driver.find_elements(By.CSS_SELECTOR, ‘div[data-testid="property-card"]‘)

for card in property_cards:

...
all_properties.append(property_details)

try:
next_button = driver.find_element(By.XPATH, ‘//button[@aria-label="Next page"]‘)
next_button.click()
except:

break

Making the Booking.com Scraper More Robust

This basic scraping script works, but it has a few potential issues:

  • It doesn‘t wait for the next page to load before trying to parse
  • It can‘t recover from errors or pick up where it left off
  • It‘s relatively slow since it runs sequentially

Let‘s look at a few techniques to make our Booking.com scraper faster and more reliable.

Waiting for Page Loads

As is, our script assumes the content from the next page will be available instantly after clicking. However, there‘s often a delay, especially on slow connections.

We can integrate Selenium Wire, an extension of Selenium that allows inspecting the requests made by the browser. With it, we can wait for the XHR request that loads new results before parsing:


from seleniumwire import webdriver

...

def wait_for_results(driver):
driver.wait_for_request(‘https://www.booking.com/fragment.en-us.json‘, timeout=15)
del driver.requests

Calling this function after each page click ensures data is available when we try to extract it.

Parallelizing Across Search Result Pages

While waiting for requests reduces errors, it slows down the crawl. To speed things up, we can parallelize the scraper, loading and parsing multiple pages concurrently.

There are a few ways to achieve concurrency in Python, but using the multiprocessing library is often the simplest:


from concurrent.futures import ProcessPoolExecutor, as_completed

...

def scrape_page(driver, page_num):
try:
driver.get(f"{url}&offset={page_num*25}")

...

except:
return []

return properties

...

with ProcessPoolExecutor() as executor:
futures = [executor.submit(scrape_page, driver, i) for i in range(num_pages)]

for future in as_completed(futures):
results = future.result()
all_properties.extend(results)

This divides up the search result pages between multiple worker processes, allowing them to be scraped in parallel.

Persisting the Scraped Data

So far, we‘ve accumulated the scraped property data in a Python list. This is fine for a small, one-off project, but has some drawbacks:

  • The data is lost if the script crashes
  • It consumes increasing memory as the list grows
  • Storing and sharing the data is harder

Instead, we can save each property to a database as it‘s scraped. This provides a persistent record that can survive crashes and stores data more efficiently.

For example, here‘s how we could insert each property into a SQLite database:


import sqlite3

conn = sqlite3.connect(‘booking.db‘)
c = conn.cursor()

c.execute(‘‘‘CREATE TABLE IF NOT EXISTS properties
(name TEXT, address TEXT, rating REAL, num_reviews INTEGER, price TEXT)‘‘‘)

...

def save_property(property):
with conn:
c.execute("INSERT INTO properties VALUES (?, ?, ?, ?, ?)", (
property[‘name‘],
property[‘address‘],
property[‘rating‘],
property[‘number_of_reviews‘],
property[‘price‘] ))

...

for property in properties:
save_property(property)

With this approach, we have a permanent, structured record of the scraped properties that can be easily queried and shared. Switching to a client-server database like MySQL or PostgreSQL would allow scaling the data store and accessing it from multiple machines.

Integrating with Scrapy

For large-scale, production web scraping projects, it‘s often beneficial to leverage an existing scraping framework rather than building from scratch with Selenium.

Scrapy is a popular Python framework that provides a full suite of tools for writing and running robust web crawlers. Some of its key features include:

  • Built-in concurrency and
  • Automatic retries and error handling
  • Pluggable middleware for extensibility
  • Support for exporting to various formats

While Scrapy spiders typically make HTTP requests directly rather than driving a full browser, it‘s possible to integrate Scrapy with Selenium for handling sites that require JavaScript.

Here‘s a simplified example of what a Scrapy spider for Booking.com might look like:


import scrapy
from scrapy_selenium import SeleniumRequest

class BookingSpider(scrapy.Spider):
name = ‘bookingcom‘

def start_requests(self):
    yield SeleniumRequest(
        url=‘https://www.booking.com/searchresults.html?ss=London‘,
        wait_time=5,
        callback=self.parse
    )

def parse(self, response):
    properties = response.css(‘div[data-testid="property-card"]‘)

    for card in properties:
         # extract property details
         yield property

    try:    
        next_button = response.css(‘#search_results_table .paging-next‘).get()
        if next_button:
            yield SeleniumRequest(
                url=response.urljoin(next_link),
                wait_time=5,
                callback=self.parse
            )
    except:        
        pass

The spider yields a series of requests, starting with the initial search results page. The parse() callback extracts data from each property card, yields the resulting item, and recursively follows the pagination links.

Scrapy‘s SeleniumRequest class, provided by the scrapy-selenium plugin, handles rendering the page with Selenium before extracting properties using standard Scrapy selectors. This allows combining the powerful abstractions of Scrapy with Selenium‘s ability to handle JavaScript-heavy sites.

By default, Scrapy will run requests in parallel and handle retrying on failures, so much of the infrastructure we had to build ourselves with plain Selenium comes out of the box. The framework also provides a variety of built-in options for exporting and storing the scraped data.

However, integrating Selenium does add significant overhead compared to a pure Scrapy approach. In many cases, it may be preferable to intercept and reverse-engineer the underlying API requests made by Booking.com using your browser‘s developer tools. Accessing the raw data in JSON format is often much faster and more stable than attempting to parse data from the rendered HTML.

Closing Thoughts

Web scraping is a powerful tool for gathering data from Booking.com and other online travel agencies. With some basic Python skills and tools like Selenium and Scrapy, it‘s possible to build a scraper that extracts large amounts of data on properties, rates, and availability.

However, web scraping also comes with significant responsibilities. It‘s crucial to respect the rules laid out in the target website‘s robots.txt file and terms of service. Booking.com‘s terms explicitly prohibit scraping their site without express permission:

Automated queries (including screen and database scraping, etc.,) are not permitted on Booking.com, unless you have obtained prior written permission from Booking.com.

Violating these terms could lead Booking.com to take legal action or ban your IP address from accessing their site. For many use cases, Booking.com offers an official API that can provide a more reliable and permissible way to access their data.

If you do proceed with scraping Booking.com, take care to throttle your requests to a reasonable rate and avoid disrupting the normal operation of their site. Use a custom user agent string so that Booking.com can identify your crawler and reach out if there are issues. And of course, be sure to comply with all relevant data protection laws such as the GDPR when collecting and processing personal information.

With those caveats in mind, this guide should provide a solid foundation for building a robust and efficient web scraper for Booking.com. Equipped with an understanding of Selenium, Scrapy, and strategies for responsible scraping, you‘ll be well on your way to turning Booking.com‘s vast trove of travel data into valuable insights. Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *