How to Scrape Websites Protected by Cloudflare with Python and Cloudscraper

If you‘ve ever tried scraping a website only to be blocked by anti-bot measures, you know how frustrating it can be. Over 7.59 million active websites use Cloudflare‘s bot protection services, so chances are high that your target site is guarded by this formidable adversary.

Fortunately, you don‘t have to admit defeat. In this in-depth tutorial, I‘ll show you exactly how to use the open-source Cloudscraper Python library to bypass Cloudflare and scrape the data you need. We‘ll walk through a practical code example and I‘ll explain some of the advanced features of Cloudscraper.

But I won‘t stop there. I‘ll also highlight the current limitations of Cloudscraper and recommend the most effective alternative approach using a managed API service. My goal is for you to walk away armed with the knowledge and tools to tackle even the most heavily-guarded websites. Let‘s get started!

Why Cloudflare Makes Scraping So Difficult

First, let‘s understand the adversary we‘re up against. Cloudflare is a web infrastructure and security company that provides a content delivery network, DDoS mitigation, and other services to enhance the security, performance, and reliability of websites.

One of Cloudflare‘s key offerings is bot protection. By analyzing web traffic patterns and using browser fingerprinting techniques, Cloudflare can identify and block requests from suspected bots and scrapers. If you‘ve ever encountered a CAPTCHA challenge or been asked to "click pictures of crosswalks," you‘ve experienced Cloudflare‘s arsenal firsthand.

Websites protected by Cloudflare present a few obstacles for scrapers:

They may require you to wait 5 seconds before allowing access, disrupting automated scraping.
Frequent/suspicious requests get flagged and blocked with a challenge page.
Advanced fingerprinting identifies scrapers mimicking browsers.
CAPTCHAs are used to prove a human is behind the requests.

The result is that your scraper‘s requests get denied with an intimidating 403 Forbidden error or CAPTCHA loop. Game over, right? Not necessarily! Enter Cloudscraper.

What is Cloudscraper and How Does it Work?

Cloudscraper is an open-source Python library based on the popular Requests library, designed specifically for scraping Cloudflare-protected websites. It automates the process of solving challenges and proving your scraper is a trustworthy "human" user.

Here‘s how Cloudscraper pulls it off:

Mimics the behavior of real web browsers using headers and SSL/TLS
Executes JavaScript challenges using a JS interpreter
Waits the required ~5 seconds before solving challenges
Can auto-solve CAPTCHAs using plugins for services like 2captcha

Essentially, Cloudscraper convincingly imitates human/browser behavior to get past Cloudflare‘s defenses. And with 3.8k+ stars on GitHub, it‘s battle-tested by a large community of developers.

Now that we understand what we‘re up against and what weapons we have, let‘s get our hands dirty with some code! I‘ll walk you through the process of scraping the protected namecheap.com domain registrar website to extract domain names and prices.

Step-by-Step Cloudscraper Tutorial

Step 1 – Install Cloudscraper

First make sure you have Python 3.6+ and pip installed. Then run:

pip install cloudscraper

We‘ll also need the BeautifulSoup library for parsing HTML:

pip install beautifulsoup4

Step 2 – Set Up Cloudscraper

Now in a new Python file, import Cloudscraper and create a scraper instance:

import cloudscraper

scraper = cloudscraper.create_scraper(
  interpreter=‘nodejs‘,
  captcha={
    ‘provider‘: ‘2captcha‘,
    ‘api_key‘: ‘YOUR_2CAPTCHA_API_KEY‘
  }
)

This sets up our scraper with:

A Node.js JavaScript interpreter to solve challenges
2captcha as the CAPTCHA solving service (requires signing up for an API key)

There are many other configuration options for user agent, request delays, and more that I encourage you to check out in the docs.

Step 3 – Make Request and Parse Response

With our Cloudscraper instance ready, we can make a request just like with the Requests library:

url = ‘https://www.namecheap.com/‘

response = scraper.get(url)

if response.status_code == 200:
  print(‘Request successful!‘)
else:
  print(f‘Request failed with status {response.status_code}‘)

Cloudscraper abstracts away the details of solving the Cloudflare challenges. If all goes well, you should see "Request successful!" meaning we got the HTML content.

Now let‘s parse out the data we want using BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, ‘html.parser‘)

products = []

for product in soup.select(‘.domain-tld‘):
  name = product.select_one(‘.domain-tld-name‘).text
  price = product.select_one(‘.domain-tld-pricing‘).text

  products.append({
    ‘name‘: name,
    ‘price‘: price
  })

print(products)

This code will find all the elements with the "domain-tld" class, extract the name and price, and print them out. You should see a list of domains and prices like:

[
  {‘name‘: ‘.com‘, ‘price‘: ‘$5.98‘}, 
  {‘name‘: ‘.ai‘, ‘price‘: ‘$69.98‘}, ...
]

Congrats! You‘ve just scraped your first Cloudflare-protected website. But there‘s more to consider.

Limitations of Cloudscraper

While Cloudscraper is a powerful tool, it‘s not a silver bullet. As Cloudflare evolves its bot detection methods, open-source libraries like Cloudscraper can lag behind, leading to more CAPTCHAs and blocks.

For example, the popular OKCupid dating website is protected by the latest version of Cloudflare. Try as you might, Cloudscraper will fail with a 403 Forbidden error after being routed to endless CAPTCHAs.

The issue is that Cloudflare‘s detection has outsmarted Cloudscraper‘s browser mimicking capabilities.

It‘s an ongoing cat-and-mouse game, and relying solely on open-source tools is risky for production scraping projects. So what‘s the solution?

Alternative: Using a Managed Scraping API

Rather than trying to reverse-engineer Cloudflare yourself, it‘s often more reliable and efficient to use a managed scraping API service like ScrapingBee.

These services abstract away the complexities of rendering JavaScript, rotating proxies and user agents, and solving CAPTCHAs. They have teams dedicated to keeping up with the latest Cloudflare updates so you don‘t have to.

Here‘s how easy it is to use ScrapingBee‘s Python SDK:

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)

response = client.get(
  ‘https://www.okcupid.com/profile/some-username‘,
  params = { 
    ‘render_js‘: ‘true‘
  }
)

print(response.status_code)
print(response.content)

Just like that, ScrapingBee handles JavaScript rendering, Cloudflare solving, and proxy rotation. For most scraping projects, especially at scale, managed APIs provide the most consistent and maintainable approach.

Of course, these premium services come with a cost, whereas Cloudscraper is free. It‘s up to you to weigh the tradeoffs of cost, reliability, and development time.

Putting it All Together

Cloudflare may be a formidable adversary for web scrapers, but with the right tools and techniques, you can still get the data you need.

Open-source libraries like Cloudscraper offer a free and flexible option for coders willing to get their hands dirty. With some configuration and customization, Cloudscraper can handle most Cloudflare-protected websites by mimicking browser behavior.

However, for more challenging websites and large-scale projects, managed APIs like ScrapingBee often provide a more robust and hassle-free solution. By abstracting away the arms race with Cloudflare, they let you focus on your core scraping logic.

Whichever approach you choose, remember that web scraping is a constantly evolving field. What works today may be obsolete tomorrow. But equipped with the right knowledge and tools, you‘ll be ready to adapt and overcome any anti-bot measures in your way.

Happy scraping!

Why Cloudflare Makes Scraping So Difficult

What is Cloudscraper and How Does it Work?

Step-by-Step Cloudscraper Tutorial

Step 1 – Install Cloudscraper

Step 2 – Set Up Cloudscraper

Step 3 – Make Request and Parse Response

Limitations of Cloudscraper

Alternative: Using a Managed Scraping API

Putting it All Together

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide