How to Capture Background Requests and Responses in Playwright

Intercepting network requests is an invaluable skill for any web scraper. It unlocks the ability to debug traffic, extract additional data, modify requests on the fly, mock responses and handle authentication.

In this comprehensive 2500+ word guide, we‘ll dive deep on using Playwright in Python to intercept both frontend and background traffic.

Why Intercepting Requests is Important

Before we get into the code, it‘s worth stepping back and looking at why capturing network requests matters for building robust web scrapers.

Monitoring Traffic for Debugging

Like any software, scrapers run into issues. Being able to log and analyze all HTTP requests helps identify and fix problems like:

4xx/5xx errors
Unexpected redirects
Sources of latency
Pages blocked by filters

Debugging complex sites often requires digging into headers and parameters to see what‘s really happening.

Extracting Data from Background APIs

Modern sites rely heavily on REST APIs and JavaScript to load data. Important content is often only available by tapping into these backend calls.

For example, the main product data on an ecommerce site might come from /api/productsendpoint. Without intercepting that request, you can‘t access the full catalog.

Modifying Requests On the Fly

Having total control over requests allows customizing them for your specific needs:

Adding login credentials or API keys
Changing user agent or other headers
Adjusting parameters and filters
Rerouting to different API versions

This level of customization is required to scrape many modern sites effectively.

Mocking Responses

For scripting complex scenarios, being able to mock responses is invaluable:

Testing edge cases like 404s
Developing with fake data
Simulating rate limiting and throttling
Isolating flaky endpoints

Mocking also speeds up scraping by avoiding slow requests.

Automating Authentication Flows

Capturing login request
Extracting CSRF tokens
Filling user/pass dynamically

This removes the need for manual login or hard-coded sessions.

Blocking Unnecessary Traffic

Many sites send extraneous requests to 3rd parties for analytics, marketing etc. Being able to block these avoids wasting bandwidth.

Simulating Network Conditions

Testing real world scenarios like flaky connections requires the ability to:

Delay requests by arbitrary durations
Retry failed requests
Abort requests
Override DNS resolution

This validation ensures your scraper is resilient to different network environments.

These examples demonstrate the importance of complete control over network traffic for robust browser automation. Now let‘s see how Playwright enables this in Python.

Overview of Network Events in Playwright

Playwright provides a powerful network interception API through the page.route(), page.on(‘request‘) and page.on(‘response‘) methods.

Some key capabilities enabled by request interception:

Logging requests/responses for debugging
Reading data from background requests
Modifying requests and responses on the fly
Blocking requests
Retrying failed requests
Throttling request speeds
Mocking responses with fake data
Collecting detailed performance metrics
Exporting detailed HAR logs
Handling authentication

This works consistently for page navigations, XHRs, WebSockets, Webhooks or any other request. Let‘s look at how to leverage these capabilities.

Logging All Requests and Responses

The simplest way to capture network traffic is attaching event handlers to the page object:

from playwright.sync_api import sync_playwright

def print_request(request):
  print(request.url)
  print(request.headers) 

def print_response(response):
  print(response.url)  
  print(response.status)
  print(response.headers)

with sync_playwright() as p:
  browser = p.chromium.launch()
  page = browser.new_page()

  page.on("request", print_request)
  page.on("response", print_response)

  page.goto("https://www.example.com")

This will log the URL, headers, and response status for all requests.

A few things to note:

The request handler runs first before the request is sent. This allows modifying it.
The response handler runs after the response is received. This provides access to the response body.
These handshake synchronously without need for async/await.

You can filter the logging by request type as shown next.

Reading Background XHR Requests

To differentiate backend XHR requests from page navigations, check request.resource_type:

def print_xhr_request(request):
  if request.resource_type == "xhr":
    print("Background XHR request:")
    print(request.method + " " + request.url)

def print_xhr_response(response):
  if response.request.resource_type == "xhr":  
    print("Background XHR response:")
    print(response.status)

Now only the dynamic XHR requests will be printed, ignoring static resources like images.

This technique can extract data from REST APIs, GraphQL endpoints and JSON responses powering the frontend.

On a typical site, 80% of important data comes from backend XHR requests versus 20% from the raw HTML. Intercepting them vastly expands scraping possibilities.

Accessing Full Response Bodies

To access the complete response body of a request, use the response.text() method:

def print_response(response):
   print(response.url)  
   print(response.text()) # Full body

There are also convenience parsers like response.json() to handle JSON automatically.

A few caveats around response bodies:

They buffer fully in memory, so avoid giant responses.
Compression is not decoded automatically.
Timeouts can occur if body takes long time to buffer.

For large responses, stream parsing is recommended.

Modifying Requests on the Fly

Intercepting requests allows modifying them before they are sent:

def intercept_request(request):

  # Add/override headers 
  request.headers["User-Agent"] = "My Bot 1.0"

  # Change URL
  request.url = request.url.replace("/v1/", "/v2/")

  # Set method
  request.method = "POST"

  # Set post data 
  request.post_data = {"key": "value"}

  # Return modified request
  return request

Some common use cases for request modification:

Adding authentication headers for restricted APIs
Changing the user agent to avoid bot detection
Rerouting API calls to different endpoints or parameters
Altering form data for automation

This enables fine-grained customization for demanding scraping jobs.

Make sure to return request after making changes so they are applied.

Blocking Requests

To block requests like trackers or unwanted media files, don‘t return from the request handler:

def block_request(request):
  if request.url.endswith(".mp4"):
    print("Blocking video download")
    return # Don‘t return to block

With no return, the request will be stopped from sending. This avoids wasting bandwidth on unnecessary resources.

Blocking by default and allowing specific resources can optimize scraping performance.

Throttling and Retrying Requests

Network issues like throttling and intermittent failures can be simulated by aborting requests:

from time import sleep 

def throttle_request(request):
  sleep(1.5) # Simulate delay
  request.abort() # Retry request

page.on("request", throttle_request)

This retries each request after a 1.5 second delay, mimicking a slow connection.

Adding randomness to the throttle sleep() simulates inconsistent latency.

Other ways to stress test scrapers:

Low failure rate (abort 10% of requests)
Occasional 5xx errors
Drastic throttling (10 sec delays)

This validation ensures the scraper is resilient to real-world flakiness.

Timing and Performance Statistics

The request object contains timings to analyze performance:

def print_timings(request):
  print(request.timestamp) # sent time
  print(request.wall_time) # completion time 
  print(request.response) # full response

Beyond wall time, there are metrics for:

DNS lookup time
Proxy negotiation time
SSL handshake time
Time to first byte
Download time
Queueing delay

Tracking these metrics helps diagnose bottlenecks and shape traffic for optimal throughput.

Exporting Detailed HAR Files

For complete analysis, capturing all details in a HAR file is invaluable:

from har import HAR

har = HAR(options={"includeResourcesFromDisk": True, "includeServerTiming": True})

def export_har(request):
   har.add_request(request)

def export_har(response):
   har.add_response(response)

page.on("request", export_har)  
page.on("response", export_har)

# ... perform crawl

har.write_har_to_file("output.har")

This log can be loaded in HAR analyzing tools to find optimization opportunities.

Custom HTTP Handlers

For complete control over request and response handling, pass a custom async handler to page.route():

async def handle_route(route, request):

  # Request modifications... 

  response = await route.continue(request)

  # Response parsing...

  return response

page.route("**/*", handle_route) # Match all routes

Inside the handler, call await route.continue(request) to send the modified request and get the response.

This provides a single interceptor with full access to mutate traffic.

Debugging WebSockets and Webhooks

Playwright‘s network handling works for WebSockets and webhook requests too:

page.on("websocket", print_websocket_traffic)
page.on("request", print_webhook_headers)

This provides visibility into non-HTTP connections the page makes.

WebSockets are especially useful for scraping real-time data pushed from the server.

Mocking API Responses

For scripting tests, you may want to mock API responses with fake data:

import json 

def mock_response(route, request):
  data = {
    "mockKey": "mockValue" 
  }

  route.fulfill(
    status=200, 
    body=json.dumps(data)
  )

page.route("https://api.example.com/*", mock_response)

Now calls to that domain will return the mock instead of hitting the real API.

This avoids slow API calls and allows faking edge cases like 500 errors.

Automating Authentication

Logging into sites is a common scraping challenge. Request interception allows automating this:

def auto_login(route, request):
  if request.url == "/login":
    request.post_data = {
      "username": "myuser",
      "password": "secret" 
    }
  return route.continue(request)

context = browser.new_context() 

# Install auto-login middleware
context.route("/login", auto_login)  

page = context.new_page() # Auto-logs in!

By sharing the context, all pages get the auto-login capability.

This removes the need for manual intervention or hardcoded credentials in scripts.

Some tips for robust auto-login:

Extract CSRF tokens from login form
Parse 302 redirect on success to home page
Handle multi-factor/2FA if needed
Refresh session tokens on expiry

Common Use Cases for Request Interception

Now that we‘ve covered the basics, let‘s look at some common use cases for request interception when building scrapers.

Data Extraction from REST APIs

Modern sites rely heavily on REST APIs for dynamic data. Tapping into these endpoints is key for complete scraping.

Use the resource_type and url patterns to target the JSON APIs. Parse them with response.json() for easy data access.

Scraping JavaScript SPAs

Single page apps load content asynchronously via APIs. Intercept those requests to extract the data.

Use page.waitForResponse() after clicks to handle the timing correctly.

Handling Logins and Restricted Content

Finding the login form submit request
Extracting the CSRF token
Filling credentials and POSTing data

This allows scraping restricted accounts and content.

Downloading Files

To capture files like PDFs, intercept the request and save the response body to disk:

def save_file(response):
  if response.url.endswith(".pdf"):
    with open("output.pdf", "wb") as f:
      f.write(response.body)

Collecting Structured Data

APIs return structured data, but HTML pages often lack semantics.

Use CSS selectors or XPath to extract structured data from HTML, then interleave with API responses.

This provides clean, uniform data from different sources.

Observing Web Socket Traffic

More sites are using WebSockets for real-time data. Monitor these channels by intercepting websocket events in Playwright.

This provides scraping access to live data pushed from the server.

Reverse Engineering Apps

Mobile apps rely heavily on REST APIs. Proxy the traffic through Playwright to intercept requests.

Analyze the endpoints to understand the app behavior, then directly call APIs as needed.

Limitations and Challenges

While request interception is extremely powerful, there are some caveats to be aware of:

Performance overhead – intercepting all requests has a performance cost. Use targeted filtering.
TLS errors – HTTPS errors may occur. Use custom browser contexts.
Browser differences – syntax may need tweaks for Firefox/Webkit. Test cross-browser.
No web workers – web workers have dedicated contexts without access to page events.
False timeouts – requests may time out if handler takes too long. Optimize code.
Concurrency challenges – shared state across requests can cause race conditions.
Capturing all requests – infinitely scrolling sites may have thousands of requests. Set reasonable limits.
Web app integrity – mocking APIs or modifying parameters too aggressively may break functionality.

Proper error handling and testing is advised when intercepting requests to avoid issues.

Conclusion

Intercepting network traffic unlocks game-changing capabilities for web scraping and automation.

Using Playwright‘s robust request handling APIs, we can:

Log requests for debugging
Extract data from backend APIs
Modify headers/parameters on the fly
Retry failed requests
Mock API responses
Handle authentication
Block unnecessary traffic
Stress test flaky connections
Analyze performance in depth
Export detailed HAR logs

All through simple event handlers on the Page object.

Learning to tap into these events is an indispensable skill for building production-grade scrapers. It provides insight into all network activity and control over it.

With great power comes responsibility. Make sure to use targeted intercepts judiciously to avoid degrading performance or site integrity.

Do you utilize Playwright request interception in your scrapers? What other use cases have you found valuable? Let me know in the comments!