Block Resources and Speed Up Web Scraping with Puppeteer

Puppeteer is a popular and powerful Node.js library that allows you to control a headless Chrome browser. It‘s commonly used for web scraping, automated testing, and building bots. One of the challenges when scraping websites with Puppeteer is that web pages often load many resources that aren‘t needed for your use case, such as images, videos, stylesheets, and scripts. These extraneous resources can significantly slow down your scraping jobs.

Fortunately, Puppeteer provides several ways to block specific resources from loading. By preventing unneeded resources, you can dramatically speed up your scraping scripts and consume less bandwidth in the process. Blocking certain resources can also help avoid detection by making your Puppeteer browser appear more like a human user.

In this in-depth guide, we‘ll explore multiple methods you can use to easily block resources in Puppeteer. Whether you‘re new to Puppeteer or an experienced user looking to optimize your scripts, this article will provide you with the tools and code examples you need. Let‘s dive in!

Blocking Resources with Puppeteer‘s Request Interception API

The simplest built-in way to block resources in Puppeteer is by using the page.setRequestInterception() method. This instructs Puppeteer to intercept and inspect every request made by the browser. You can then choose to allow or block each request individually.

Here‘s a basic example of using request interception to block all image requests:

const puppeteer = require(‘puppeteer‘);


(async () => {

const browser = await puppeteer.launch();

const page = await browser.newPage();
await page.setRequestInterception(true);
page.on(‘request‘, (request) => {

if (request.resourceType() === ‘image‘) {

request.abort();

} else {

request.continue();

}

});

await page.goto(‘https://example.com‘); await page.screenshot({path: ‘screenshot.png‘}); await browser.close(); })();

After creating a new page, we enable request interception by calling page.setRequestInterception(true). This allows us to listen for the ‘request‘ event which is emitted for every request made by the page.

In the request event handler, we check the request.resourceType() to determine if it is an image. Puppeteer supports a variety of resource types including ‘document‘, ‘stylesheet‘, ‘image‘, ‘media‘, ‘font‘, ‘script‘, ‘texttrack‘, ‘xhr‘, ‘fetch‘, ‘eventsource‘, ‘websocket‘, ‘manifest‘, and ‘other‘.

If the request is an image, we block it by calling request.abort(). Otherwise, the request is allowed to continue with request.continue(). Finally, we navigate to a URL, take a screenshot, and close the browser.

You can also choose to block resources by inspecting the URL. For example, here‘s how you could block requests for PNG and JPEG images by looking at the file extension:

page.on(‘request‘, (request) => { if (request.url().endsWith(‘.png‘) || request.url().endsWith(‘.jpg‘)) { request.abort(); } else { request.continue(); } });

Request interception gives you a lot of control but can also slow things down since it inspects every single request. Next let‘s look at a more efficient way to block resources globally.

Blocking Resources with Puppeteer Plugins

While Puppeteer‘s built-in request interception gets the job done, a more streamlined way to block resources is by using a Puppeteer plugin. The excellent puppeteer-extra library provides a plugin called block-resources that makes it a breeze.

First install the necessary packages:

npm install puppeteer-extra puppeteer-extra-plugin-block-resources

Then require them in your script and configure the plugin:

const puppeteer = require(‘puppeteer-extra‘); const blockResourcesPlugin = require(‘puppeteer-extra-plugin-block-resources‘)();


// Block images, media, and stylesheets

blockResourcesPlugin.blockedTypes.add(‘image‘);

blockResourcesPlugin.blockedTypes.add(‘media‘);

blockResourcesPlugin.blockedTypes.add(‘stylesheet‘);

puppeteer.use(blockResourcesPlugin);

The plugin supports blocking the same resource types as Puppeteer, but instead of inspecting each request individually, it blocks them globally for improved performance. Simply add the resource types you want blocked to blockedTypes.

After configuring the plugin, use Puppeteer as you normally would:

(async () => { const browser = await puppeteer.launch(); const page = await browser.newPage();

await page.goto(‘https://example.com‘); await page.screenshot({path: ‘screenshot.png‘}); await browser.close(); })();

Without any images, stylesheets or media, the page will load much faster. You can customize which resources are blocked to suit your needs.

The block-resources plugin also supports allowing or blocking based on URL patterns. For example, here‘s how you could block all resources except those from your own domain:

blockResourcesPlugin.blockedTypes.add(‘*‘); blockResourcesPlugin.allowedDomains.add(‘example.com‘);

This is a powerful feature that lets you easily implement allow or block lists for particular domains and resources.

Setting Request Interception in Browser Context

Another efficient way to block resources globally in Puppeteer is by setting request interception in the browser context. This has the advantage of not requiring any plugins.

To do this, first create the browser with await puppeteer.launch() . Then set request interception for all pages in the browser:

const browser = await puppeteer.launch(); await browser.defaultBrowserContext().overridePermissions(‘https://example.com‘, []);

await browser.defaultBrowserContext().setRequestInterception(true); browser.on(‘request‘, (request) => { if (request.resourceType() === ‘image‘) { request.abort(); } else { request.continue(); } });

Now requests will be intercepted for every new page you create with await browser.newPage() without having to set it for each one individually. This can help reduce boilerplate code.

The browser.on(‘request‘, handler) works the same as the page-level request interception we looked at earlier, letting you allow or block requests by URL or resource type.

Using a Chrome Extension to Block Resources

Finally, let‘s look at one more way to block resources in Puppeteer – by using an existing Chrome extension. There are many ad blocking and privacy extensions that efficiently block resources inside the Chrome browser. These can easily be used with Puppeteer as well.

Here‘s an example of configuring Puppeteer to use the popular uBlock Origin extension:

const browser = await puppeteer.launch({ headless: false, args: [ ‘--disable-extensions-except=/path/to/ublock-origin‘, ‘--load-extension=/path/to/ublock-origin‘, ] });

This assumes you have the uBlock Origin extension folder saved locally on your machine. Adjust the path as needed.

The --disable-extensions-except flag disables all other extensions for maximum performance and no conflicts. The --load-extension flag tells Puppeteer which extension to load.

Now when you create pages and navigate to URLs, uBlock will automatically block ads, trackers, and other unwanted resources according to its extensive filter lists. This can be a quick and easy way to block resources without having to maintain allow/block lists yourself.

However, using extensions can also cause issues if they are not compatible with the version of Chrome that your Puppeteer install uses. Be sure to test extensively and keep extensions and Puppeteer up to date.

Conclusion

As you can see, Puppeteer provides a variety of ways to block unwanted resources and speed up your web scraping scripts. Whether you choose to use the built-in request interception API, a plugin like puppeteer-extra-plugin-block-resources, or an existing Chrome extension, blocking unneeded resources is an important optimization for any Puppeteer project.

The ability to block resources is not only useful for increasing performance. It can also help make your headless browser harder to detect by simulating a real user who would have ads and trackers blocked. And you‘ll use less bandwidth by not downloading resources you don‘t need.

I encourage you to try out the different methods covered in this guide and see which one works best for your specific use case. With a little experimentation, you‘ll be able to find the optimal configuration that maximizes speed while still allowing your scraper to function correctly.

To learn more, check out the official Puppeteer docs as well as the puppeteer-extra docs. You may also be interested in our other guides on avoiding detection while web scraping and scaling up Puppeteer workflows.

If you have any questions or run into issues, don‘t hesitate to reach out or leave a comment below. Happy scraping!

Blocking Resources with Puppeteer‘s Request Interception API

Blocking Resources with Puppeteer Plugins

Setting Request Interception in Browser Context

Using a Chrome Extension to Block Resources

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide