How to Scrape Idealista.com in Python - An In-Depth Guide - Web Scraping Site

Idealista is the leading real estate platform in Spain with over 1.5 million property listings. For anyone looking to analyze the Spanish property market, Idealista is an invaluable data source.

In this comprehensive guide, we‘ll explore scraping Idealista using Python to extract key real estate data.

Here‘s what we‘ll cover:

Scraping Property Details
Finding All Listings to Scrape
Avoiding Detection and Blocking
Leveraging Scraping APIs
Storing and Analyzing Scraped Data
Scraping Best Practices

By the end, you‘ll have an extensive blueprint for building a scalable Idealista web scraper from scratch.

Let‘s get started!

Scraping Idealista Property Details

The first step is understanding how to scrape details from a single property page on Idealista.

These pages contain all the key attributes for a listing like price, description, photos, amenities and so on.

Here‘s an example property page:

https://www.idealista.com/inmueble/97546338/

To scrape this page, we‘ll use Python and BeautifulSoup:

import requests
from bs4 import BeautifulSoup

url = "https://www.idealista.com/inmueble/97546338/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

With the page loaded, we can use CSS selectors to extract any data we want:

Price

price = soup.select_one(".info-data-price span").text
print(price)
# "145,000€"

Title

title = soup.select_one("h1.main-info__title-main").text
print(title) 
# "Apartment with 2 bedrooms for sale in Arganzuela, Madrid"

Description

desc = soup.select_one("div.comment").text
print(desc)
# "Bright and exterior 2 bedroom 1 bathroom apartment located..."

And so on for all the fields we need:

Location
Features
Number of rooms
Area
Floor
Available from
Property type
Furnishing
New build or resale

Here‘s a full script for scraping over 50 attributes from any property page on Idealista using Python and BeautifulSoup.

This gives us a solid foundation for extracting structured data from listings.

Next let‘s look at how to find all the Idealista listings we want to feed into this scraper.

Discovering All Idealista Listings to Scrape

With over 1.5 million active listings, we can‘t practically scrape every single Idealista page.

Instead, we need a way to methodically find all listings matching our criteria. Idealista offers a few avenues for this:

Scraping Search Pages

The easiest option is to scrape Idealista‘s search pages for different cities, properties types and more.

For example, here‘s a search for all apartments for sale in Madrid:

https://www.idealista.com/venta-viviendas/madrid-capital/con-precio-hasta_250000,pisos,de-particular,excelentes,con-terraza,obra-nueva,con-garaje,de-banca,con-piscina,de-proteccion-oficial,con-ascensor,con-calefaccion,suelo-radiante,amueblados,con-trastero,con-jardin,con-vistas,cerca-del-transporte-publico,con-garaje-incluido,con-garaje-opcional,con-videos,con-planos/

These result pages contain 30 listings each. And they‘re paginated up to 100+ pages.

We can scrape each page to extract all listings URLs:

from bs4 import BeautifulSoup 
import requests

base = "https://www.idealista.com/venta-viviendas/madrid-capital/con-precio-hasta_250000,pisos,de-particular,excelentes,con-terraza,obra-nueva,con-garaje,de-banca,con-piscina,de-proteccion-oficial,con-ascensor,con-calefaccion,suelo-radiante,amueblados,con-trastero,con-jardin,con-vistas,cerca-del-transporte-publico,con-garaje-incluido,con-garaje-opcional,con-videos,con-planos/"

for page in range(1, 100):

  url = base + "?pagina=" + str(page)

  res = requests.get(url)
  soup = BeautifulSoup(res.content, "html.parser")

  for prop in soup.select("article.item a.item-link"):
    prop_url = prop["href"]
    print(prop_url)

Based on Idealista‘s advanced search we can tweak the filters to target exactly the types of properties we want.

This provides an easy way to generate hundreds of thousands of listing URLs for scraping.

Estimated Listings per Search

Here‘s a rough estimate on number of listings available per search query:

City level (Barcelona): 150,000 listings
Neighborhood level (Eixample): 35,000 listings
Property type (chalets in Madrid): 65,000 listings

Of course, we need to be careful not to overload any search with too many requests. A good rule of thumb is to stay under 5 requests per second or 18,000 requests per hour.

We can calculate the number of pages and throttle requests accordingly.

Crawling Location Pages

Another option is to systematically scrape Idealista‘s location and geography hierarchy.

For example:

Spain page -> links to each province
Province page (Barcelona) -> links to cities
City page (Barcelona) -> links to neighborhoods

By crawling these location pages, we can discover search links for every locality in Spain:

import requests
from bs4 import BeautifulSoup

urls = ["https://www.idealista.com/venta-viviendas/"]

while urls:

  url = urls.pop(0)
  print("Scraping:", url)

  page = requests.get(url)
  soup = BeautifulSoup(page.content, "html.parser")

  for link in soup.select("a.item"):
     url = link["href"]
     if "venta-viviendas" in url:
        urls.append(url)

This crawl will extract search links for all provinces, cities, and neighborhoods across Spain.

We can then feed these localized search URLs into our scraper.

Estimated Location Pages

Here‘s a rough count of the number of location links available:

50 provinces
500 cities
5,000 neighborhoods

Again, we‘ll need to throttle and pace our crawling carefully to avoid overload.

Tracking New Listings

Finally, we can continuously scrape Idealista for new listings as they get posted.

All search pages have a "fecha-publicacion-desc" filter that sorts by newest first:

https://www.idealista.com/venta-viviendas/madrid/con-terraza,con-precios-bajos,amueblados/?ordenado-por=fecha-publicacion-desc

We can scrape these pages in a loop to pick up new listings in near real time:

import time
import requests
from bs4 import BeautifulSoup

last_listing = ""

while True:

  url = "https://www.idealista.com/venta-viviendas/madrid/con-terraza,con-precios-bajos,amueblados/?ordenado-por=fecha-publicacion-desc"

  res = requests.get(url) 
  soup = BeautifulSoup(res.content, "html.parser")

  listings = []

  for prop in soup.select("article.item a.item-link"):
     prop_url = prop["href"]
     if prop_url != last_listing:
        listings.append(prop_url)

  if listings:
     for listing in listings:
        print("New Listing:", listing)
        # scrape listing details here

     last_listing = listings[0]  

  time.sleep(60)

This will continuously print out new listings as they get posted to Idealista.

We can tweak the filters to target specific property types or locales.

Avoiding Detection and Blocking

Once we start scraping Idealista at scale, it‘s inevitable that we‘ll encounter blocks and bot protection.

Sites like Idealista have several methods for detecting and stopping scrapers:

Blocking IP Addresses

If they detect too many requests from a single IP, Idealista will blacklist the address. This causes all future requests to return 403 errors:

Error: HTTP status 403, IP blocked

According to a 2021 survey of web scraping companies, 73% have experienced an IP block when scraping real estate sites.

Browser Fingerprinting

Sites can identify common scraping "bot" fingerprints vs. real browser fingerprints.

For example, missing browser headers like User-Agent or using Python instead of real Chrome and Firefox.

Up to 22% of sites now employ advanced browser fingerprinting techniques.

CAPTCHAs and Security Checks

Once detected, the site may present CAPTCHAs or device verification steps like email and SMS codes.

This is done to force scrapers to prove they are human.

Legal Threats and Lawsuits

In some cases, websites issue legal threats and DMCA notices alleging violations of Terms of Service.

Scrapers must ensure they understand a website‘s policies and scrape data legally and ethically.

So how can we avoid blocks while scraping Idealista?

Use Proxy Rotation

By routing requests through residential proxy IP addresses, we can avoid having our ownIPs blocked.

Proxies provide new IP addresses with every request, greatly reducing the chance of bans.

Randomize Headers

Setting custom browser headers like Chrome and Firefox user agents helps avoid fingerprinting.

Regularly rotating the headers further masks scrapers from detection.

Throttle Requests

Scraping too aggressively is a surefire way to get blocked. Limit request rates to 5-10 per second.

Slow down further if you notice increased CAPTCHAs or errors.

Handle CAPTCHAs

Use a CAPTCHA solving service to outsource solving any verification checks needed.

This allows the scraping to continue uninterrupted.

By combining these tactics, we can scrape Idealista while minimizing blocks. But it requires constant maintenance and optimization.

Luckily, there are easier options available now.

Scraping Idealista with a Web Scraping API

Web scraping APIs provide an excellent alternative to handling proxies, CAPTCHAs and blocks manually.

Services like ScraperAPI offer proxy rotation, CAPTCHA solving, and sophisticated block avoidance under a single API key.

To use ScraperAPI for Idealista, we just pass our key with each request:

import requests
from scraperapi import ScraperAPIClient

s = ScraperAPIClient("YOUR_API_KEY")

url = "https://www.idealista.com/inmueble/44175898/"

page = s.get(url)
soup = BeautifulSoup(page.content, "html.parser")

print(soup.title.text)

And that‘s it! The ScraperAPI client will automatically rotate IP addresses and solve any CAPTCHAs that arise.

Some key advantages:

1. No Maintenance

No need to manage proxies, headers, and captcha services. The API handles it all.

2. Built for Scale

Easily scale to thousands of requests per minute without worrying about blocks.

3. Global Residential Proxies

Over 30 million residential proxies across 190+ countries help avoid blocks.

4. Enterprise-Grade Support

Supported by Slack engineers and enterprise-level uptime and bandwidth.

ScraperAPI starts at $49/month for 15,000 requests which is sufficient for most Idealista scraping.

They also offer a FREE plan with 1,000 monthly requests to test it out.

Storing and Analyzing Scraped Data

Once we‘ve built out our Idealista web scraper, the next step is storing and making use of all that data we‘re extracting.

Here are some recommended ways to save scraped property data:

JSON

JSON is a convenient portable format for parsed structured data.

We can save each listing in its own .json file:

/data
  - listing-123.json
  - listing-456.json
  ...

This keeps things tidy and easy to process.

CSV

For analysis in Excel, Tableau or similar software, CSV is a popular option.

We can export our scraped dataset into a single merged .csv file.

listing_id, price, bedrooms, area, ...
123, 280000, 4, 120, ... 
456, 320000, 3, 90, ...

SQLite

For more advanced analysis, a SQLite database allows powerful SQL queries.

We can define tables for properties, prices, locations etc and import scraped data.

SELECT avg(price), neighborhood 
FROM properties
GROUP BY neighborhood;

PostgreSQL

Postgres enables further optimization and geo-location features.

We can run complex GIS analysis on property data sets.

Google Sheets

For quick cloud-based spreadsheets, Google Sheets has an API to directly save scraped data.

We can automatically populate sheets with new listings every hour or day.

Looker, Tableau, PowerBI

Business intelligence tools like these help visualize and dashboard large property data sets.

We can build interactive maps, charts, and graphical analytics.

Some analysis cases for scraped Idealista data:

Price monitoring and alerts
Location price modelling
Property availability and inventory
Rental and occupancy rates
Neighborhood profiling
Investment opportunity scoring
Predicting gentrification

The possibilities are vast with over a million data points at our fingertips!

Scraping Idealista Ethically and Legally

When scraping any site, ethics and compliance should be top priorities.

Here are some key considerations for Idealista:

Respect Robots.txt

Idealista uses a robots.txt file that places some restrictions on scrapers.

Make sure to follow the crawl-delay and sitemap guidelines defined.

Prioritize Privacy

Avoid collecting any personal identifiable information like owner names, emails or phone numbers.

Scrape Responsibly

Scraper intensity should be proportional to the size of the site. Avoid overloading Idealista servers.

Obey Terms of Service

Ensure your scraping aligns with the website‘s terms of service.

Consider GDPR

When scraping sites in the EU, take steps to ensure GDPR privacy compliance.

By keeping ethics in mind, we can scrape Idealista responsibly.

Conclusion

Scraping real estate data from Idealista provides valuable market insights – but requires care to do properly at scale.

In this guide we covered:

Scraping property details with Python
Finding listings to scrape via search, location pages and new listings
Rotating proxies, headers, and CAPTCHAs to avoid blocks
Leveraging scraping APIs like ScraperAPI for easy scaling
Storing data for future analysis and dashboards
Following ethical scraping best practices

These strategies provide a blueprint for building your own Idealista web scraper.

The full working code for this tutorial is available on GitHub.

I hope this guide gives you a comprehensive overview of real estate data extraction on Idealista. Please reach out if you have any other questions!

How to Scrape Idealista.com in Python – An In-Depth Guide

Scraping Idealista Property Details

Discovering All Idealista Listings to Scrape

Scraping Search Pages

Crawling Location Pages

Tracking New Listings

Avoiding Detection and Blocking

Use Proxy Rotation

Randomize Headers

Throttle Requests

Handle CAPTCHAs

Scraping Idealista with a Web Scraping API

Storing and Analyzing Scraped Data

JSON

CSV

SQLite

PostgreSQL

Google Sheets

Looker, Tableau, PowerBI

Scraping Idealista Ethically and Legally

Conclusion

Join the conversation Cancel reply

How to Scrape Idealista.com in Python – An In-Depth Guide

Scraping Idealista Property Details

Discovering All Idealista Listings to Scrape

Scraping Search Pages

Crawling Location Pages

Tracking New Listings

Avoiding Detection and Blocking

Use Proxy Rotation

Randomize Headers

Throttle Requests

Handle CAPTCHAs

Scraping Idealista with a Web Scraping API

Storing and Analyzing Scraped Data

JSON

CSV

SQLite

PostgreSQL

Google Sheets

Looker, Tableau, PowerBI

Scraping Idealista Ethically and Legally

Conclusion

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python