How to Capture Background Requests and Responses in Puppeteer

Puppeteer is a powerful Node.js library that provides a high-level API for controlling headless Chrome. It enables you to launch a browser instance programmatically and automate browser actions like clicking links and filling out forms.

One extremely useful but less commonly used Puppeteer feature is the ability to intercept both browser requests and responses. By tapping into these events, you can monitor and modify network traffic, which opens up all sorts of possibilities for web scraping and automation.

In this comprehensive guide, you‘ll learn:

What are request and response interception in Puppeteer and why it‘s useful for web scraping
How to set up request and response interception
Capturing, monitoring, and modifying requests and responses
Common use cases like blocking resources, mocking responses, and debugging network issues

What is Request and Response Interception in Puppeteer?

Browsers make requests all the time in the background to load page resources like images, stylesheets, scripts, fonts, and HTML documents. Puppeteer gives you access to these requests through request interception.

Likewise, when the browser receives responses for those requests, you can tap into those as well through response interception.

By intercepting requests and responses, you can monitor network traffic, block certain requests, mock responses, modify data in flight, and more. This gives you tremendous power when it comes to web scraping and automation.

Some examples of what you can do:

Log all requests and responses for debugging
Block ads, analytics scripts, and other unnecessary resources to speed up page load
Mock API responses to test your frontend without relying on real backend services
Modify request headers to rotate user agents, bypass bot protection, or authenticate
Alter response data to sanitize content or extract fields through DOM manipulation

So in summary, request and response interception opens up the browser‘s network stack for inspection and manipulation from your Node.js code. It provides invaluable visibility into what the browser is doing under the hood.

How to Set Up Request and Response Interception

To start intercepting requests and responses in Puppeteer, you first need to launch a browser instance and navigate to a page.

const puppeteer = require(‘puppeteer‘);

(async () => {

  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto(‘https://example.com‘);

})();

With the page loaded, enable request interception by calling page.setRequestInterception and passing true. This will pause any network requests and allow you to inspect them.

  await page.setRequestInterception(true);

Next, add event listeners for the request and response events:

  page.on(‘request‘, interceptedRequest => {
    //...
  });

  page.on(‘response‘, interceptedResponse => {
    //...
  });

The request callback will fire whenever a request is about to go out. The interceptedRequest object contains details like the request URL, method, headers, post data, and more.

Likewise, the response callback will fire when a response comes back. The interceptedResponse object contains properties like the status code, headers, and body.

Now that the listeners are set up, you need to explicitly call request.continue() to allow requests to proceed. Otherwise they will be paused indefinitely.

page.on(‘request‘, interceptedRequest => {

  // inspect request

  interceptedRequest.continue();

});

And that‘s the basic setup! Next let‘s look at some examples capturing, monitoring, and modifying requests and responses.

Logging All Requests and Responses

One simple but useful application is to log all requests and responses to the console. This gives you full visibility into what resources the page is loading and what data is being exchanged.

page.on(‘request‘, interceptedRequest => {

  console.log(‘Request:‘, interceptedRequest.url());

  interceptedRequest.continue(); 

});

page.on(‘response‘, interceptedResponse => {

  console.log(‘Response:‘, interceptedResponse.url(), interceptedResponse.status());

});

This will print something like:

Request: https://example.com/ 
Request: https://example.com/styles.css
Request: https://example.com/script.js
Response: https://example.com/ 304
Response: https://example.com/styles.css 200  
Response: https://example.com/script.js 200

You can see the URL and HTTP status code for every request and response. This helps identify resources being loaded and debug issues.

Blocking Requests

Another common use case is blocking certain requests like ads, analytics scripts, and images to speed up page load.

You can select which requests to block based on properties like the URL or resource type:

page.on(‘request‘, interceptedRequest => {

  if(interceptedRequest.resourceType() === ‘image‘) {
    interceptedRequest.abort(); 
  } else {
    interceptedRequest.continue();
  }

});

This will cancel all image requests, preventing them from being loaded.

Or block based on URL patterns:

const blockedUrls = [
  ‘https://analytics.example.com‘,
  ‘https://example.com/ad?‘
];

page.on(‘request‘, interceptedRequest => {

  if(blockedUrls.some(url => interceptedRequest.url().includes(url)) {
    interceptedRequest.abort();
  } else {  
    interceptedRequest.continue();
  }

});

Now any resources from those domains will be blocked.

Modifying Requests

Request interception also allows you to modify requests in flight.

For example, you can rewrite the URL:

page.on(‘request‘, interceptedRequest => {

  const url = interceptedRequest.url();

  if (url.includes(‘foo=bar‘)) {
    interceptedRequest.continue({
      url: url.replace(‘foo=bar‘, ‘foo=baz‘) 
    });
  } else {
    interceptedRequest.continue(); 
  }

});

This will replace foo=bar with foo=baz in any request URLs.

You can also add, remove, or edit headers:

page.on(‘request‘, interceptedRequest => {

  const headers = Object.assign({}, interceptedRequest.headers()); 

  delete headers[‘User-Agent‘];

  headers[‘X-Custom-Header‘] = ‘Scraping‘;

  interceptedRequest.continue({headers});

});

This removes the User-Agent header and adds a custom one.

With the ability to rewrite requests, you can do things like rotate user agents, bypass bot protection, authenticate requests, and more.

Mocking API Responses

Request interception enables you to mock API responses as well. This is extremely powerful for testing user interfaces and eliminating external dependencies.

For example, you could hijack requests to your JSON API and return mock data instead:

const mockData = {
  /* mock API response */ 
};

page.on(‘request‘, interceptedRequest => {

  if(interceptedRequest.url().endsWith(‘/api/data‘)) {
    interceptedRequest.respond({
      contentType: ‘application/json‘,
      body: JSON.stringify(mockData)
    });
  } else {
    interceptedRequest.continue();
  }

});

Now interactions with the frontend will load instantly using fake data instead of hitting real APIs.

Modifying Responses

Like requests, response interception allows you to modify responses in flight.

You can rewrite the HTML to remove elements:

page.on(‘response‘, interceptedResponse => {

  if(interceptedResponse.url().endsWith(‘.html‘)) {
    const body = interceptedResponse.text();

    const newBody = body.replace(‘<div id="ad">‘, ‘‘); 

    interceptedResponse.text(newBody);
  }

  interceptedResponse.continue();

});

Or parse and transform JSON data:

page.on(‘response‘, async interceptedResponse => {

  const json = await interceptedResponse.json();

  // modify JSON data

  const newJson = transformData(json); 

  interceptedResponse.body(JSON.stringify(newJson));

  interceptedResponse.continue();

});

With the ability to manipulate responses, you can extract just the data you need, filter out unwanted content, and more.

Debugging Web App Issues

Request and response interception is invaluable for debugging complex web apps.

Some ways you can debug network issues:

Log errors and inspect stack traces
Check for failed requests and missing resources
Verify status codes and confirm returned data
Monitor headers like caching policies and compression
Identify loading waterfalls and performance bottlenecks

For example, you may see 404 responses indicating missing assets:

Request: https://example.com/404.js
Response: https://example.com/404.js 404

Or API errors with a 500 status:

Request: https://api.example.com/data
Response: https://api.example.com/data 500

Knowing exactly which requests are failing goes a long way toward debugging problems.

You can also analyze the initiation and timing of requests to optimize loading. Identifying fast vs slow endpoints can help prioritize performance improvements.

In summary, request and response interception grants complete low-level visibility into network activity. This transparency is invaluable when building and troubleshooting complex browser-based applications.

Conclusion

Being able to intercept browser requests and responses unlocks a whole new level of power and visibility for browser automation and web scraping.

Some key takeaways:

Enable interception with page.setRequestInterception(true)
Listen for request and response events
Inspect interceptedRequest and interceptedResponse objects
Explicitly continue intercepted requests
Log, block, mock, or modify requests and responses
Identify issues, optimize performance, extract data, and more!

The ability to tap into network traffic is extremely useful for web scraping, automating tests, debugging apps, and more. Request and response interception brings Puppeteer automation to the next level.

I hope this guide provides you a comprehensive overview of request and response interception in Puppeteer. Let me know if you have any other questions!

What is Request and Response Interception in Puppeteer?

How to Set Up Request and Response Interception

Logging All Requests and Responses

Blocking Requests

Modifying Requests

Mocking API Responses

Modifying Responses

Debugging Web App Issues

Conclusion

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python