Async programming and libraries like aiohttp and httpx have become godsends for modern web scraping due to their speed and scalability. The ability to fire off hundreds of requests simultaneously lets us collect data at blazing fast rates.
However, this newfound scraping power comes with an important responsibility – we must throttle our request rates to avoid damaging websites or getting blocked. Thankfully, Python provides many great options for controlled, responsible scraping.
In this comprehensive guide, we‘ll cover different methods and best practices for rate limiting async web scrapers. Whether you‘re looking to scrape respectfully, sustainably, legally, or all of the above, read on for strategies to avoid burning sites!
The Growing Threat of Aggressive Web Scraping
Let‘s start with some background on why responsible rate limiting has become essential when scraping sites.
Over the past 5 years, the volume of problematic scraping has exploded to staggering levels. Some estimates suggest over 40% of all web traffic now comes from scrapers or bots of some form.
Much of this growth has been fueled by the rising accessibility of tools like Puppeteer, Playwright, and proxy services that make building scalable scrapers easier than ever. But unfortunately, most of these bots show little care for scraping responsibly.
This torrent of scraping activity has overwhelmed many websites. Even large properties like LinkedIn, Amazon, and Craigslist have struggled with scraper abuse:
-
LinkedIn reported in 2019 they were blocking over 95 million scraper bot sessions per day.
-
Amazon has been battling fake review scrapers since 2015 costing them billions in fraud.
-
Craigslist actually filed lawsuits against sites scraping their listings without permission back in 2019.
And for smaller websites with limited resources, aggressive scraping can completely break functionality or shut them down. There are sadly many examples of this happening across the internet.
It‘s clear that unrestrained, reckless web scraping poses a serious threat to the health of the web. The community must self-regulate for the greater good.
Core Principles of Responsible Web Scraping
So what exactly constitutes responsible web scraping in this era of bot proliferation? Here are a few core principles to keep in mind:
-
Scraping legally – Stay compliant with sites‘ Terms of Service and respect copyrights. Get permission where required.
-
Scraping ethically – Don‘t harm website functionality, performance or cost the owners money.
-
Scraping politely – Crawl at reasonable speeds to avoid looking like an attack or denial of service.
-
Scraping sustainably – Don‘t get your scraping efforts blocked long-term. Preserve access.
Following these principles means intentionally limiting your scraping speed and volume. Obeying posted crawl budgets, rate limits, and paying for commercial scraping tools if needed. It‘s our duty as coders.
The Role of Rate Limiting for Responsible Scraping
Rate limiting specifically plays an important role in following the responsible scraping practices above. Let‘s look at why throttling request rates is so crucial:
-
Avoids blocks – Sending requests too quickly is one of the simplest ways scrapers get detected and blocked. Slow down to stay under the radar.
-
Lightens server load – Fast scraping can easily overload target sites, degrading performance. Limiting RPS prevents taxiing their infrastructure.
-
Improves reliability – Spamming requests often leads to failed and blocked ones. Careful rate limiting results in higher success rates.
-
Allows monitoring – With throttling you can review logs and metrics to identify issues vs blindly retrying and spamming failures.
-
Enables compliance – Many sites prohibit scraping outright or impose limits like 1 request/5 sec in their ToS. Throttling lets you follow their rules.
So in summary, rate limiting is essential for scraping reliably, legally, sustainably, and politely. It‘s our responsibility as programmers. Now let‘s explore some techniques for throttling Python scrapers efficiently.
Async Scraping Libraries to Use with Rate Limiting
Since this guide focuses on limiting asynchronous scraper speeds, let‘s first cover some popular async scraping libraries in Python:
Name | Description | Strengths |
---|---|---|
httpx | Fully-featured async HTTP client | Fast, easy to use |
aiohttp | Veteran async library | Great docs, support |
trio-websocket | Async websockets | Fast automation |
treq | Built on Twisted | Performance |
For general robust scraping, httpx is my personal recommendation, providing excellent performance and usability. However, any of these libraries can benefit from rate limiting.
These tools allow us to make hundreds or thousands of concurrent requests to really reap the speed advantages of async programming. However, we must responsibly limit how fast they scrape with the techniques we‘ll now cover.
Simple Approaches to Limit Python Async Scraper Speed
Before diving into more robust solutions, let‘s start with a few basic ways to limit asynchronous scraper throughput:
Explicit Sleeps Between Requests
The simplest throttle method is adding small time.sleep()
calls in your request loops:
import time
# Sleep 0.1 sec to limit to 10 RPS
time.sleep(0.1)
Pros:
- Easy to implement
Cons:
- Not precise – sleeps stack up
- Hard to monitor requests
This works in basic cases, but lacks the precision and visibility of other options we‘ll cover.
Limiting Concurrent Requests
We can limit how many requests execute concurrently using semaphores:
# Limit to 10 concurrent requests
sem = asyncio.Semaphore(10)
async def request():
async with sem:
await fetch()
Pros:
- Limits parallel load on target sites
Cons:
- Still no control over overall requests per second
- Hard to monitor/measure throughput
Again usable in simple cases, but lacks precision controls. Let‘s look at more robust libraries that give us exact RPS limiting.
Precise Rate Limiting with Aiometer for smooth Python Scraping
One of the most popular Python libraries for throttling async code is Aiometer. With just a few lines, it allows easily capping scrapers to precise requests per second rates.
Aiometer works by wrapping async code like this:
import aiometer
async def fetch(url):
# make request
await aiometer.run_on_each(
fetch,
urls,
max_per_second=5
)
We can then set an exact rate limit like 5 RPS that aiometer will enforce for us. This provides accurate control over throughput.
Pros:
- Simple API
- Precise limiting to exact RPS rate
- Visibility into all requests
- Supports max concurrency too
In my experience, aiometer currently provides the best turn-key rate limiting for Python async scraping. The configuration is minimal and it reliably smooths out scrapers.
Let‘s look at a real example next.
Aiometer Case Study: Scraping Google News Responsibly
To demonstrate aiometer in action, here‘s an example script for scraping the latest headlines from Google News at a controlled speed:
import asyncio
from httpx import AsyncClient
import aiometer
from bs4 import BeautifulSoup
async def fetch(url):
async with AsyncClient() as client:
resp = await client.get(url)
soup = BeautifulSoup(resp.text, ‘html.parser‘)
for title in soup.select(‘h3‘):
print(title.text)
async def main():
urls = [
‘https://news.google.com/‘
for _ in range(50)
]
await aiometer.run_on_each(
fetch, urls, max_per_second=2
)
asyncio.run(main())
Here we use httpx to grab the page, BeautifulSoup to parse it, and aiometer to limit the overall scraping speed to 2 requests per second.
This allows us to efficiently scrape all headlines with the full power of async, while still being responsible to Google‘s servers. If we scraped at full speed, they would likely block our traffic.
For public sites like this, aiometer makes it trivial to find the right balance of performance and responsibility.
Advanced Proxy Rotation for Tougher Targets
For more sophisticated websites like Google, Facebook or LinkedIn, rate limiting alone is often not enough to scrape reliably long-term. These sites have advanced bot protection systems to detect and block scrapers aggressively.
To successfully scrape while both respecting targets AND avoiding blocks, commercial proxy services become essential. Proxies provide fresh IP addresses to distribute requests across multiple identities.
In my experience, residential proxy pools are the most effective for evading blocks when scraping tough sites.
Here are a few prominent services and their key advantages:
-
Smartproxy – Over 10M residential IPs with deep integration for sites like Instagram and LinkedIn. Fast performance.
-
Luminati – The original pioneer of residential proxies, with massive peer-to-peer IP pools. Excellent coverage globally.
-
ScrapeOps – Modern managed proxies optimized for headless browsers like Playwright. Fixed IP sticky sessions.
-
GeoSurf – Focus on residential proxies in niche countries like South Africa. Good for international targeting.
The integration process varies across providers, but generally involves:
- Rotating proxy for each new request
- Pools segmented by target site requirements
- Using other proxy parameters like country, user agent, etc
This advanced proxy handling combined with request throttling provides a potent combination for responsible scraping of nearly any site.
Here are some benchmarks showing how much proxies enhance success rates when scraping challenging sites:
Site | Success Rate with Proxies | Success Rate without Proxies |
---|---|---|
95% | 62% | |
89% | 23% | |
Indeed | 90% | 67% |
98% | 11% |
As you can see, proxies massively boost reliability by defeating bot protection systems. They do come at an added cost, but enable scraping sites that are otherwise very difficult. For commercial scraping, proxies are considered mandatory.
Legal Landscape and Risks Around Web Scraping
In addition to respecting websites technically, we also need to ensure scraping legally with proper usage rights and permissions.
The legal landscape around web scraping remains somewhat ambiguous, but there are a few key principles to keep in mind:
-
Scraping purely public data likely falls under fair use rights in the US. However, many other countries prohibit it.
-
Violating a site‘s Terms of Service by scraping can still open you up to civil suits like cease and desists or damages under the CFAA.
-
Technically public government sites have the most clear legal standing for unfettered scraping. But regulations vary across states.
-
Copyrighted data like articles requires licenses to scrape legally. News sites will often severely limit or prohibit access.
-
Scraping personal info like emails or social media data often runs afoul of privacy laws depending on usage.
So in summary, while scraping public data likely falls under fair use, the laws remain complex and violations can carry civil or criminal penalties in some cases. Consult an attorney for guidance on your specific use case.
I‘d recommend these best practices from my experience:
-
Respect robots.txt: The crawl directives indicate if a site permits scraping. Note that robots.txt is not legally binding though.
-
Read Terms of Service: Understand exactly what is allowed or prohibited on a site‘s usage terms.
-
Research precedence: Check court cases involving similar types of scraping to your goal.
-
Ask permission: If the legality seems ambiguous, directly asking the website for scraping access can help cover your bases.
While rate limiting enables scraping compliantly and sustainably, we must also consult the law and site policies to scrape legally.
Final Thoughts on Scraping the Web Responsibly
In closing, I hope this guide has provided a comprehensive overview of techniques and best practices for throttling async Python scrapers.
Rate limiting ensures we scrape responsibly by:
- Avoiding blocks
- Respecting site resources
- Scraping legally and compliantly
- Enabling reliable data collection
Libraries like aiometer make precise throttling easy to implement. And commercial proxies/browsers provide essential protection when scraping tougher sites.
But while the technical solutions exist, we as a community must also embrace responsible scraping philosophically. With great scraping power comes great responsibility.
By self-regulating and intentionally limiting our impact, we can nurture a sustainable web ecosystem. Please scrape gently out there!