Request blocking is an invaluable but nuanced technique for supercharging Playwright based web scrapers by minimizing unnecessary downloads. In this comprehensive expert guide, we’ll share insider techniques to leverage request interception for optimized data extraction.
The Curse of Bloated Sites
Modern websites are plagued by page bloat – the average site is over 3MB with 90+ requests across dozens of domains! This explosion has been driven by:
- Proliferation of trackers for analytics and ads
- Heavyweight javascript frameworks and libraries
- Ever growing media like high-res images and auto-playing video
- Third party social media and comment integrations
For web scrapers this bloat has a huge performance impact:
- Slow page loads – browsers are forced to download hundreds of resources simultaneously
- High bandwidth – most bytes are wasted on irrelevant resources like graphics
- Scrape overload – parsing the full HTML pulls in irrelevant data
Playwright‘s request interception gives us a powerful tool to optimize our scrapers by downloading only what we need. Let‘s dive in!
Request Interception Basics
Playwright provides a request API that fires for each network request made by a page. This includes all assets like images, scripts, XHR calls etc.
We can register a request handler to analyze each request and decide whether to allow, block, or modify it:
from playwright.sync_api import sync_playwright
def request_handler(route):
# Request analysis and blocking logic
if should_block(route.request):
route.abort()
else:
route.continue_()
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Register handler to intercept requests
page.route(‘**/*‘, request_handler)
page.goto(‘https://example.com‘)
The route
handler receives a Route
object containing details like the request URL, resource type, headers, and postData. We can implement custom logic to decide whether to block the request or not.
Blocking Strategies for Common Resource Types
Some resource types are obvious candidates for blocking as they are rarely relevant for web scraping:
Images – The bulk of bytes on most sites. But useless for scraping.
BLOCK_TYPES = [‘image‘]
if route.resource_type in BLOCK_TYPES:
route.abort()
Fonts – Custom fonts for icons/branding. Blocking may degrade visuals.
BLOCK_TYPES = [‘font‘]
Media – Videos, audio clips. Heavy bandwidth for no scrape value.
BLOCK_TYPES = [‘media‘]
Beacons – Analytics pings back to parent sites.
BLOCK_TYPES = [‘beacon‘]
Flash – Mostly obsolete. Security risk.
BLOCK_TYPES = [‘object‘]
Other "safe" options include csp_report, imageset, and texttrack.
Be wary of broadly blocking content types like stylesheet, script, and xhr that may break page functionality.
Blocking by URL Pattern
We can also block requests by full or partial URL string matching. This allows blocking specific third party domains.
For example ad networks, social media, and known tracking scripts:
BLOCK_URLS = [
‘*google-analytics*‘,
‘*twitter*‘,
‘*doubleclick*‘
]
for url in BLOCK_URLS:
if re.search(url, route.request.url):
route.abort()
Maintaining a URL blocklist requires vigilance as sites change over time. Some tips:
- Analyze network logs to identity frequently requested domains
- Leverage curated blocklists like EasyPrivacy
- Monitor scraper errors in case critical resources are blocked
I recommend blocking obvious cruft while avoiding overly broad domain blocking.
The Performance Impact
Intelligently blocking unnecessary resource types and tracking domains can significantly optimize page load performance:
- 2-10x reduction in bytes downloaded – directly speeds up page loads
- Lightweight pages – less browser overhead and memory usage
- Lower bandwidth usage – critical if you have data caps
Here are examples from real sites:
Site | Requests | Transfer Size | Blocked | New Requests | New Transfer | Savings |
---|---|---|---|---|---|---|
news.com | 92 | 8.7 MB | 55% blocked | 41 | 1.9 MB | 78% bandwidth reduction |
forum.com | 124 | 5.1 MB | 63% blocked | 46 | 1.2 MB | 76% bandwidth reduction |
shop.com | 142 | 15.9 MB | 49% blocked | 73 | 6.1 MB | 62% bandwidth reduction |
As you can see, intelligently blocking resources provides massive optimization opportunities!
Diminishing Returns
You can only optimize so much before affecting site functionality:
[Chart showing diminishing bandwidth reduction as more resources blocked]There are diminishing returns beyond ~75% blocking. And more aggressive blocking increases risk of breakage.
I recommend targeting a 2-3x reduction which gets the "low hanging fruit". Measure overall scraper speed to dial this in.
Maintaining Blocklists
Sites are dynamic – new trackers and resources are constantly introduced. Blocklists require maintenance to keep them relevant.
Some tips for managing this:
- Re-scrape key pages periodically and analyze new requests
- Monitor scraper errors in case critical resources are blocked
- Leverage curated blocklists like EasyList with periodic updates
- Use tools like Blocklist-Tools to auto-generate domain lists
- Analyze network traffic to identify new patterns
Ideally blocklists are auto-generated from scraping with selective manual review.
Advanced Blocking Patterns
Request blocking logic can get quite advanced. Some additional techniques:
Blocking 3rd party requests only:
if not route.request.is_navigation_request:
if ‘example.com‘ not in route.request.url:
route.abort()
Blocking after N requests per domain:
domains = {}
if domains.get(route.request.domain, 0) >= 5:
route.abort()
domains[route.request.domain] += 1
This limits each third party domain to 5 requests.
User agent spoofing:
route.request.headers[‘User-Agent‘] = ‘SearchBot‘
Some sites serve less bloat to known bots.
Integrating Blocking into a Scraper
Let‘s look at a full scraping script with optimized request blocking:
import re
from playwright.sync_api import sync_playwright
URL = ‘https://www.example.com‘
BLOCK_TYPES = [‘image‘, ‘media‘]
BLOCK_DOMAINS = [‘tracker.com‘, ‘cdn.ads.com‘]
def block_requests(route):
if route.resource_type in BLOCK_TYPES:
route.abort()
if re.search(pattern, route.request.url)
for domain in BLOCK_DOMAINS:
route.abort()
else:
route.continue_()
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.route(‘**/*‘, block_requests) # Configure blocking
page.goto(URL)
# Extract data from page
browser.close()
We intercept all requests to selectively block images, media, and known ad/tracking domains.
Optimizing the Data Extraction
Request blocking optimizes bandwidth and page load speed. But we still need to extract relevant data from the page.
Some tips:
- Use a parser like BeautifulSoup vs parsing the full raw HTML
- Extract specific elements like product listings rather than entire DOM
- Load partial page using a selector rather than complete HTML
- Block different resources on each load to piece together data
Balance request blocking with extraction completeness for your specific scraping needs.
Closing Advice
Request blocking is immensely powerful but requires some skill:
- Thoroughly test sites to avoid blocking critical resources
- Measure overall scraper speed to fine tune gains
- Block just enough rather than blindly blocking all of a type
- Expect to periodically update block rules as sites evolve
Follow the techniques outlined here to make your Playwright scraper as fast and efficient as possible! Feel free to reach out if you have any other request blocking challenges.