The Complete Guide to Blocking Resources for Blazing Fast Web Scraping

Request blocking is an invaluable but nuanced technique for supercharging Playwright based web scrapers by minimizing unnecessary downloads. In this comprehensive expert guide, we’ll share insider techniques to leverage request interception for optimized data extraction.

The Curse of Bloated Sites

Modern websites are plagued by page bloat – the average site is over 3MB with 90+ requests across dozens of domains! This explosion has been driven by:

Proliferation of trackers for analytics and ads
Heavyweight javascript frameworks and libraries
Ever growing media like high-res images and auto-playing video
Third party social media and comment integrations

For web scrapers this bloat has a huge performance impact:

Slow page loads – browsers are forced to download hundreds of resources simultaneously
High bandwidth – most bytes are wasted on irrelevant resources like graphics
Scrape overload – parsing the full HTML pulls in irrelevant data

Playwright‘s request interception gives us a powerful tool to optimize our scrapers by downloading only what we need. Let‘s dive in!

Request Interception Basics

Playwright provides a request API that fires for each network request made by a page. This includes all assets like images, scripts, XHR calls etc.

We can register a request handler to analyze each request and decide whether to allow, block, or modify it:

from playwright.sync_api import sync_playwright

def request_handler(route):
  # Request analysis and blocking logic
  if should_block(route.request):
    route.abort()
  else:
    route.continue_() 

with sync_playwright() as p:
  browser = p.chromium.launch()
  page = browser.new_page()

  # Register handler to intercept requests
  page.route(‘**/*‘, request_handler)

  page.goto(‘https://example.com‘)

The route handler receives a Route object containing details like the request URL, resource type, headers, and postData. We can implement custom logic to decide whether to block the request or not.

Blocking Strategies for Common Resource Types

Some resource types are obvious candidates for blocking as they are rarely relevant for web scraping:

Images – The bulk of bytes on most sites. But useless for scraping.

BLOCK_TYPES = [‘image‘]

if route.resource_type in BLOCK_TYPES:
  route.abort()

Fonts – Custom fonts for icons/branding. Blocking may degrade visuals.

BLOCK_TYPES = [‘font‘]

Media – Videos, audio clips. Heavy bandwidth for no scrape value.

BLOCK_TYPES = [‘media‘]

Beacons – Analytics pings back to parent sites.

BLOCK_TYPES = [‘beacon‘]

Flash – Mostly obsolete. Security risk.

BLOCK_TYPES = [‘object‘]

Other "safe" options include csp_report, imageset, and texttrack.

Be wary of broadly blocking content types like stylesheet, script, and xhr that may break page functionality.

Blocking by URL Pattern

We can also block requests by full or partial URL string matching. This allows blocking specific third party domains.

For example ad networks, social media, and known tracking scripts:

BLOCK_URLS = [
  ‘*google-analytics*‘,
  ‘*twitter*‘,
  ‘*doubleclick*‘ 
]

for url in BLOCK_URLS:
  if re.search(url, route.request.url):
    route.abort()

Maintaining a URL blocklist requires vigilance as sites change over time. Some tips:

Analyze network logs to identity frequently requested domains
Leverage curated blocklists like EasyPrivacy
Monitor scraper errors in case critical resources are blocked

I recommend blocking obvious cruft while avoiding overly broad domain blocking.

The Performance Impact

Intelligently blocking unnecessary resource types and tracking domains can significantly optimize page load performance:

2-10x reduction in bytes downloaded – directly speeds up page loads
Lightweight pages – less browser overhead and memory usage
Lower bandwidth usage – critical if you have data caps

Here are examples from real sites:

Site	Requests	Transfer Size	Blocked	New Requests	New Transfer	Savings
news.com	92	8.7 MB	55% blocked	41	1.9 MB	78% bandwidth reduction
forum.com	124	5.1 MB	63% blocked	46	1.2 MB	76% bandwidth reduction
shop.com	142	15.9 MB	49% blocked	73	6.1 MB	62% bandwidth reduction

As you can see, intelligently blocking resources provides massive optimization opportunities!

Diminishing Returns

You can only optimize so much before affecting site functionality:

[Chart showing diminishing bandwidth reduction as more resources blocked]

There are diminishing returns beyond ~75% blocking. And more aggressive blocking increases risk of breakage.

I recommend targeting a 2-3x reduction which gets the "low hanging fruit". Measure overall scraper speed to dial this in.

Maintaining Blocklists

Sites are dynamic – new trackers and resources are constantly introduced. Blocklists require maintenance to keep them relevant.

Some tips for managing this:

Re-scrape key pages periodically and analyze new requests
Monitor scraper errors in case critical resources are blocked
Leverage curated blocklists like EasyList with periodic updates
Use tools like Blocklist-Tools to auto-generate domain lists
Analyze network traffic to identify new patterns

Ideally blocklists are auto-generated from scraping with selective manual review.

Advanced Blocking Patterns

Request blocking logic can get quite advanced. Some additional techniques:

Blocking 3rd party requests only:

if not route.request.is_navigation_request:
  if ‘example.com‘ not in route.request.url:
    route.abort()

Blocking after N requests per domain:

domains = {}

if domains.get(route.request.domain, 0) >= 5:
  route.abort()

domains[route.request.domain] += 1

This limits each third party domain to 5 requests.

User agent spoofing:

route.request.headers[‘User-Agent‘] = ‘SearchBot‘

Some sites serve less bloat to known bots.

Integrating Blocking into a Scraper

Let‘s look at a full scraping script with optimized request blocking:

import re
from playwright.sync_api import sync_playwright 

URL = ‘https://www.example.com‘

BLOCK_TYPES = [‘image‘, ‘media‘]
BLOCK_DOMAINS = [‘tracker.com‘, ‘cdn.ads.com‘]

def block_requests(route):
  if route.resource_type in BLOCK_TYPES:
    route.abort()

  if re.search(pattern, route.request.url) 
    for domain in BLOCK_DOMAINS:
      route.abort()

  else:
    route.continue_()

with sync_playwright() as p:
  browser = p.chromium.launch()

  page = browser.new_page()
  page.route(‘**/*‘, block_requests) # Configure blocking

  page.goto(URL)

  # Extract data from page

  browser.close()

We intercept all requests to selectively block images, media, and known ad/tracking domains.

Optimizing the Data Extraction

Request blocking optimizes bandwidth and page load speed. But we still need to extract relevant data from the page.

Some tips:

Use a parser like BeautifulSoup vs parsing the full raw HTML
Extract specific elements like product listings rather than entire DOM
Load partial page using a selector rather than complete HTML
Block different resources on each load to piece together data

Balance request blocking with extraction completeness for your specific scraping needs.

Closing Advice

Request blocking is immensely powerful but requires some skill:

Thoroughly test sites to avoid blocking critical resources
Measure overall scraper speed to fine tune gains
Block just enough rather than blindly blocking all of a type
Expect to periodically update block rules as sites evolve

Follow the techniques outlined here to make your Playwright scraper as fast and efficient as possible! Feel free to reach out if you have any other request blocking challenges.

The Curse of Bloated Sites

Request Interception Basics

Blocking Strategies for Common Resource Types

Blocking by URL Pattern

The Performance Impact

Diminishing Returns

Maintaining Blocklists

Advanced Blocking Patterns

Integrating Blocking into a Scraper

Optimizing the Data Extraction

Closing Advice

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python