Intercepting network requests is an invaluable skill for any web scraper. It unlocks the ability to debug traffic, extract additional data, modify requests on the fly, mock responses and handle authentication.
In this comprehensive 2500+ word guide, we‘ll dive deep on using Playwright in Python to intercept both frontend and background traffic.
Why Intercepting Requests is Important
Before we get into the code, it‘s worth stepping back and looking at why capturing network requests matters for building robust web scrapers.
Monitoring Traffic for Debugging
Like any software, scrapers run into issues. Being able to log and analyze all HTTP requests helps identify and fix problems like:
- 4xx/5xx errors
- Unexpected redirects
- Sources of latency
- Pages blocked by filters
Debugging complex sites often requires digging into headers and parameters to see what‘s really happening.
Extracting Data from Background APIs
Modern sites rely heavily on REST APIs and JavaScript to load data. Important content is often only available by tapping into these backend calls.
For example, the main product data on an ecommerce site might come from /api/products
endpoint. Without intercepting that request, you can‘t access the full catalog.
Modifying Requests On the Fly
Having total control over requests allows customizing them for your specific needs:
- Adding login credentials or API keys
- Changing user agent or other headers
- Adjusting parameters and filters
- Rerouting to different API versions
This level of customization is required to scrape many modern sites effectively.
Mocking Responses
For scripting complex scenarios, being able to mock responses is invaluable:
- Testing edge cases like 404s
- Developing with fake data
- Simulating rate limiting and throttling
- Isolating flaky endpoints
Mocking also speeds up scraping by avoiding slow requests.
Automating Authentication Flows
Login forms often submit credentials in the background. Intercepting requests allows automating this:
- Capturing login request
- Extracting CSRF tokens
- Filling user/pass dynamically
This removes the need for manual login or hard-coded sessions.
Blocking Unnecessary Traffic
Many sites send extraneous requests to 3rd parties for analytics, marketing etc. Being able to block these avoids wasting bandwidth.
Simulating Network Conditions
Testing real world scenarios like flaky connections requires the ability to:
- Delay requests by arbitrary durations
- Retry failed requests
- Abort requests
- Override DNS resolution
This validation ensures your scraper is resilient to different network environments.
These examples demonstrate the importance of complete control over network traffic for robust browser automation. Now let‘s see how Playwright enables this in Python.
Overview of Network Events in Playwright
Playwright provides a powerful network interception API through the page.route()
, page.on(‘request‘)
and page.on(‘response‘)
methods.
Some key capabilities enabled by request interception:
- Logging requests/responses for debugging
- Reading data from background requests
- Modifying requests and responses on the fly
- Blocking requests
- Retrying failed requests
- Throttling request speeds
- Mocking responses with fake data
- Collecting detailed performance metrics
- Exporting detailed HAR logs
- Handling authentication
This works consistently for page navigations, XHRs, WebSockets, Webhooks or any other request. Let‘s look at how to leverage these capabilities.
Logging All Requests and Responses
The simplest way to capture network traffic is attaching event handlers to the page
object:
from playwright.sync_api import sync_playwright
def print_request(request):
print(request.url)
print(request.headers)
def print_response(response):
print(response.url)
print(response.status)
print(response.headers)
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.on("request", print_request)
page.on("response", print_response)
page.goto("https://www.example.com")
This will log the URL, headers, and response status for all requests.
A few things to note:
- The
request
handler runs first before the request is sent. This allows modifying it. - The
response
handler runs after the response is received. This provides access to the response body. - These handshake synchronously without need for async/await.
You can filter the logging by request type as shown next.
Reading Background XHR Requests
To differentiate backend XHR requests from page navigations, check request.resource_type
:
def print_xhr_request(request):
if request.resource_type == "xhr":
print("Background XHR request:")
print(request.method + " " + request.url)
def print_xhr_response(response):
if response.request.resource_type == "xhr":
print("Background XHR response:")
print(response.status)
Now only the dynamic XHR requests will be printed, ignoring static resources like images.
This technique can extract data from REST APIs, GraphQL endpoints and JSON responses powering the frontend.
On a typical site, 80% of important data comes from backend XHR requests versus 20% from the raw HTML. Intercepting them vastly expands scraping possibilities.
Accessing Full Response Bodies
To access the complete response body of a request, use the response.text()
method:
def print_response(response):
print(response.url)
print(response.text()) # Full body
There are also convenience parsers like response.json()
to handle JSON automatically.
A few caveats around response bodies:
- They buffer fully in memory, so avoid giant responses.
- Compression is not decoded automatically.
- Timeouts can occur if body takes long time to buffer.
For large responses, stream parsing is recommended.
Modifying Requests on the Fly
Intercepting requests allows modifying them before they are sent:
def intercept_request(request):
# Add/override headers
request.headers["User-Agent"] = "My Bot 1.0"
# Change URL
request.url = request.url.replace("/v1/", "/v2/")
# Set method
request.method = "POST"
# Set post data
request.post_data = {"key": "value"}
# Return modified request
return request
Some common use cases for request modification:
- Adding authentication headers for restricted APIs
- Changing the user agent to avoid bot detection
- Rerouting API calls to different endpoints or parameters
- Altering form data for automation
This enables fine-grained customization for demanding scraping jobs.
Make sure to return request
after making changes so they are applied.
Blocking Requests
To block requests like trackers or unwanted media files, don‘t return from the request handler:
def block_request(request):
if request.url.endswith(".mp4"):
print("Blocking video download")
return # Don‘t return to block
With no return, the request will be stopped from sending. This avoids wasting bandwidth on unnecessary resources.
Blocking by default and allowing specific resources can optimize scraping performance.
Throttling and Retrying Requests
Network issues like throttling and intermittent failures can be simulated by aborting requests:
from time import sleep
def throttle_request(request):
sleep(1.5) # Simulate delay
request.abort() # Retry request
page.on("request", throttle_request)
This retries each request after a 1.5 second delay, mimicking a slow connection.
Adding randomness to the throttle sleep()
simulates inconsistent latency.
Other ways to stress test scrapers:
- Low failure rate (abort 10% of requests)
- Occasional 5xx errors
- Drastic throttling (10 sec delays)
This validation ensures the scraper is resilient to real-world flakiness.
Timing and Performance Statistics
The request
object contains timings to analyze performance:
def print_timings(request):
print(request.timestamp) # sent time
print(request.wall_time) # completion time
print(request.response) # full response
Beyond wall time, there are metrics for:
- DNS lookup time
- Proxy negotiation time
- SSL handshake time
- Time to first byte
- Download time
- Queueing delay
Tracking these metrics helps diagnose bottlenecks and shape traffic for optimal throughput.
Exporting Detailed HAR Files
For complete analysis, capturing all details in a HAR file is invaluable:
from har import HAR
har = HAR(options={"includeResourcesFromDisk": True, "includeServerTiming": True})
def export_har(request):
har.add_request(request)
def export_har(response):
har.add_response(response)
page.on("request", export_har)
page.on("response", export_har)
# ... perform crawl
har.write_har_to_file("output.har")
This log can be loaded in HAR analyzing tools to find optimization opportunities.
Custom HTTP Handlers
For complete control over request and response handling, pass a custom async handler to page.route()
:
async def handle_route(route, request):
# Request modifications...
response = await route.continue(request)
# Response parsing...
return response
page.route("**/*", handle_route) # Match all routes
Inside the handler, call await route.continue(request)
to send the modified request and get the response.
This provides a single interceptor with full access to mutate traffic.
Debugging WebSockets and Webhooks
Playwright‘s network handling works for WebSockets and webhook requests too:
page.on("websocket", print_websocket_traffic)
page.on("request", print_webhook_headers)
This provides visibility into non-HTTP connections the page makes.
WebSockets are especially useful for scraping real-time data pushed from the server.
Mocking API Responses
For scripting tests, you may want to mock API responses with fake data:
import json
def mock_response(route, request):
data = {
"mockKey": "mockValue"
}
route.fulfill(
status=200,
body=json.dumps(data)
)
page.route("https://api.example.com/*", mock_response)
Now calls to that domain will return the mock instead of hitting the real API.
This avoids slow API calls and allows faking edge cases like 500 errors.
Automating Authentication
Logging into sites is a common scraping challenge. Request interception allows automating this:
def auto_login(route, request):
if request.url == "/login":
request.post_data = {
"username": "myuser",
"password": "secret"
}
return route.continue(request)
context = browser.new_context()
# Install auto-login middleware
context.route("/login", auto_login)
page = context.new_page() # Auto-logs in!
By sharing the context, all pages get the auto-login capability.
This removes the need for manual intervention or hardcoded credentials in scripts.
Some tips for robust auto-login:
- Extract CSRF tokens from login form
- Parse 302 redirect on success to home page
- Handle multi-factor/2FA if needed
- Refresh session tokens on expiry
Common Use Cases for Request Interception
Now that we‘ve covered the basics, let‘s look at some common use cases for request interception when building scrapers.
Data Extraction from REST APIs
Modern sites rely heavily on REST APIs for dynamic data. Tapping into these endpoints is key for complete scraping.
Use the resource_type
and url
patterns to target the JSON APIs. Parse them with response.json()
for easy data access.
Scraping JavaScript SPAs
Single page apps load content asynchronously via APIs. Intercept those requests to extract the data.
Use page.waitForResponse()
after clicks to handle the timing correctly.
Handling Logins and Restricted Content
Login forms often submit via XHR in the background. Capture and automate this flow by:
- Finding the login form submit request
- Extracting the CSRF token
- Filling credentials and POSTing data
This allows scraping restricted accounts and content.
Downloading Files
To capture files like PDFs, intercept the request and save the response body to disk:
def save_file(response):
if response.url.endswith(".pdf"):
with open("output.pdf", "wb") as f:
f.write(response.body)
Collecting Structured Data
APIs return structured data, but HTML pages often lack semantics.
Use CSS selectors or XPath to extract structured data from HTML, then interleave with API responses.
This provides clean, uniform data from different sources.
Observing Web Socket Traffic
More sites are using WebSockets for real-time data. Monitor these channels by intercepting websocket
events in Playwright.
This provides scraping access to live data pushed from the server.
Reverse Engineering Apps
Mobile apps rely heavily on REST APIs. Proxy the traffic through Playwright to intercept requests.
Analyze the endpoints to understand the app behavior, then directly call APIs as needed.
Limitations and Challenges
While request interception is extremely powerful, there are some caveats to be aware of:
- Performance overhead – intercepting all requests has a performance cost. Use targeted filtering.
- TLS errors – HTTPS errors may occur. Use custom browser contexts.
- Browser differences – syntax may need tweaks for Firefox/Webkit. Test cross-browser.
- No web workers – web workers have dedicated contexts without access to page events.
- False timeouts – requests may time out if handler takes too long. Optimize code.
- Concurrency challenges – shared state across requests can cause race conditions.
- Capturing all requests – infinitely scrolling sites may have thousands of requests. Set reasonable limits.
- Web app integrity – mocking APIs or modifying parameters too aggressively may break functionality.
Proper error handling and testing is advised when intercepting requests to avoid issues.
Conclusion
Intercepting network traffic unlocks game-changing capabilities for web scraping and automation.
Using Playwright‘s robust request handling APIs, we can:
- Log requests for debugging
- Extract data from backend APIs
- Modify headers/parameters on the fly
- Retry failed requests
- Mock API responses
- Handle authentication
- Block unnecessary traffic
- Stress test flaky connections
- Analyze performance in depth
- Export detailed HAR logs
All through simple event handlers on the Page object.
Learning to tap into these events is an indispensable skill for building production-grade scrapers. It provides insight into all network activity and control over it.
With great power comes responsibility. Make sure to use targeted intercepts judiciously to avoid degrading performance or site integrity.
Do you utilize Playwright request interception in your scrapers? What other use cases have you found valuable? Let me know in the comments!