Skip to content

The Complete Guide to Blocking Resources for Blazing Fast Web Scraping

Request blocking is an invaluable but nuanced technique for supercharging Playwright based web scrapers by minimizing unnecessary downloads. In this comprehensive expert guide, we’ll share insider techniques to leverage request interception for optimized data extraction.

The Curse of Bloated Sites

Modern websites are plagued by page bloat – the average site is over 3MB with 90+ requests across dozens of domains! This explosion has been driven by:

  • Proliferation of trackers for analytics and ads
  • Heavyweight javascript frameworks and libraries
  • Ever growing media like high-res images and auto-playing video
  • Third party social media and comment integrations

For web scrapers this bloat has a huge performance impact:

  • Slow page loads – browsers are forced to download hundreds of resources simultaneously
  • High bandwidth – most bytes are wasted on irrelevant resources like graphics
  • Scrape overload – parsing the full HTML pulls in irrelevant data

Playwright‘s request interception gives us a powerful tool to optimize our scrapers by downloading only what we need. Let‘s dive in!

Request Interception Basics

Playwright provides a request API that fires for each network request made by a page. This includes all assets like images, scripts, XHR calls etc.

We can register a request handler to analyze each request and decide whether to allow, block, or modify it:

from playwright.sync_api import sync_playwright

def request_handler(route):
  # Request analysis and blocking logic
  if should_block(route.request):
    route.abort()
  else:
    route.continue_() 

with sync_playwright() as p:
  browser = p.chromium.launch()
  page = browser.new_page()

  # Register handler to intercept requests
  page.route(‘**/*‘, request_handler)

  page.goto(‘https://example.com‘)

The route handler receives a Route object containing details like the request URL, resource type, headers, and postData. We can implement custom logic to decide whether to block the request or not.

Blocking Strategies for Common Resource Types

Some resource types are obvious candidates for blocking as they are rarely relevant for web scraping:

Images – The bulk of bytes on most sites. But useless for scraping.

BLOCK_TYPES = [‘image‘]

if route.resource_type in BLOCK_TYPES:
  route.abort()

Fonts – Custom fonts for icons/branding. Blocking may degrade visuals.

BLOCK_TYPES = [‘font‘]

Media – Videos, audio clips. Heavy bandwidth for no scrape value.

BLOCK_TYPES = [‘media‘]  

Beacons – Analytics pings back to parent sites.

BLOCK_TYPES = [‘beacon‘]

Flash – Mostly obsolete. Security risk.

BLOCK_TYPES = [‘object‘]

Other "safe" options include csp_report, imageset, and texttrack.

Be wary of broadly blocking content types like stylesheet, script, and xhr that may break page functionality.

Blocking by URL Pattern

We can also block requests by full or partial URL string matching. This allows blocking specific third party domains.

For example ad networks, social media, and known tracking scripts:

BLOCK_URLS = [
  ‘*google-analytics*‘,
  ‘*twitter*‘,
  ‘*doubleclick*‘ 
]

for url in BLOCK_URLS:
  if re.search(url, route.request.url):
    route.abort()

Maintaining a URL blocklist requires vigilance as sites change over time. Some tips:

  • Analyze network logs to identity frequently requested domains
  • Leverage curated blocklists like EasyPrivacy
  • Monitor scraper errors in case critical resources are blocked

I recommend blocking obvious cruft while avoiding overly broad domain blocking.

The Performance Impact

Intelligently blocking unnecessary resource types and tracking domains can significantly optimize page load performance:

  • 2-10x reduction in bytes downloaded – directly speeds up page loads
  • Lightweight pages – less browser overhead and memory usage
  • Lower bandwidth usage – critical if you have data caps

Here are examples from real sites:

SiteRequestsTransfer SizeBlockedNew RequestsNew TransferSavings
news.com928.7 MB55% blocked411.9 MB78% bandwidth reduction
forum.com1245.1 MB63% blocked461.2 MB76% bandwidth reduction
shop.com14215.9 MB49% blocked736.1 MB62% bandwidth reduction

As you can see, intelligently blocking resources provides massive optimization opportunities!

Diminishing Returns

You can only optimize so much before affecting site functionality:

[Chart showing diminishing bandwidth reduction as more resources blocked]

There are diminishing returns beyond ~75% blocking. And more aggressive blocking increases risk of breakage.

I recommend targeting a 2-3x reduction which gets the "low hanging fruit". Measure overall scraper speed to dial this in.

Maintaining Blocklists

Sites are dynamic – new trackers and resources are constantly introduced. Blocklists require maintenance to keep them relevant.

Some tips for managing this:

  • Re-scrape key pages periodically and analyze new requests
  • Monitor scraper errors in case critical resources are blocked
  • Leverage curated blocklists like EasyList with periodic updates
  • Use tools like Blocklist-Tools to auto-generate domain lists
  • Analyze network traffic to identify new patterns

Ideally blocklists are auto-generated from scraping with selective manual review.

Advanced Blocking Patterns

Request blocking logic can get quite advanced. Some additional techniques:

Blocking 3rd party requests only:

if not route.request.is_navigation_request:
  if ‘example.com‘ not in route.request.url:
    route.abort() 

Blocking after N requests per domain:

domains = {}

if domains.get(route.request.domain, 0) >= 5:
  route.abort()

domains[route.request.domain] += 1

This limits each third party domain to 5 requests.

User agent spoofing:

route.request.headers[‘User-Agent‘] = ‘SearchBot‘ 

Some sites serve less bloat to known bots.

Integrating Blocking into a Scraper

Let‘s look at a full scraping script with optimized request blocking:

import re
from playwright.sync_api import sync_playwright 

URL = ‘https://www.example.com‘

BLOCK_TYPES = [‘image‘, ‘media‘]
BLOCK_DOMAINS = [‘tracker.com‘, ‘cdn.ads.com‘]

def block_requests(route):
  if route.resource_type in BLOCK_TYPES:
    route.abort()

  if re.search(pattern, route.request.url) 
    for domain in BLOCK_DOMAINS:
      route.abort()

  else:
    route.continue_()

with sync_playwright() as p:
  browser = p.chromium.launch()

  page = browser.new_page()
  page.route(‘**/*‘, block_requests) # Configure blocking

  page.goto(URL)

  # Extract data from page

  browser.close()

We intercept all requests to selectively block images, media, and known ad/tracking domains.

Optimizing the Data Extraction

Request blocking optimizes bandwidth and page load speed. But we still need to extract relevant data from the page.

Some tips:

  • Use a parser like BeautifulSoup vs parsing the full raw HTML
  • Extract specific elements like product listings rather than entire DOM
  • Load partial page using a selector rather than complete HTML
  • Block different resources on each load to piece together data

Balance request blocking with extraction completeness for your specific scraping needs.

Closing Advice

Request blocking is immensely powerful but requires some skill:

  • Thoroughly test sites to avoid blocking critical resources
  • Measure overall scraper speed to fine tune gains
  • Block just enough rather than blindly blocking all of a type
  • Expect to periodically update block rules as sites evolve

Follow the techniques outlined here to make your Playwright scraper as fast and efficient as possible! Feel free to reach out if you have any other request blocking challenges.

Join the conversation

Your email address will not be published. Required fields are marked *