Skip to content

How to Take Screenshots with Puppeteer for Effective Web Scraping

Puppeteer is a Node.js library that provides a powerful API for controlling headless Chrome and Chromium over the DevTools Protocol. One of its most useful features is the ability to programmatically capture screenshots of web pages and elements.

For web scrapers, being able to take screenshots with Puppeteer unlocks a variety of valuable use cases:

  • Visually debugging scraping issues and test failures.
  • Capturing states of dynamic pages and SPAs.
  • Monitoring for visual regressions and UI changes.
  • Creating tutorials and documentation with screenshots for context.
  • Generating image assets from web pages.

In this comprehensive guide, we’ll explore how to leverage Puppeteer screenshots to improve your web scraping workflows.

The Rise of Puppeteer for Web Scraping

Puppeteer was first released in 2017 and has seen rapid adoption by the web scraping community. Here are a few stats that highlight its popularity:

  • Over 52,000 stars on Github making it one of the top JS projects.
  • Over 3 million weekly downloads on NPM.
  • 490% YoY growth in Google searches for Puppeteer in 2022.

So what sets Puppeteer apart for web scraping?

Headless Browser Control

Puppeteer provides full control over a headless browser via the Chrome DevTools Protocol. This allows replicating user interactions for automation and scraping dynamic content.

Lightweight and Fast

Being headless-only means Puppeteer skips all the UI rendering that makes Chromium heavyweight. This results in fast performance for at-scale scraping.

Active Development

Backed by the Chrome team at Google, Puppeteer gets frequent updates and new features tailored for automation and scraping use cases.

Simpler than Selenium

Puppeteer only focuses on controlling Chromium whereas Selenium supports multiple browsers. The API is much cleaner and idiomatic making it easy to use.

For these reasons, many web scrapers are switching over to Puppeteer from Selenium/WebDriver for improved speed, reliability and capability.

Now let‘s dive into how to leverage Puppeteer‘s powerful screenshot capabilities.

Capturing Full Page Screenshots

The easiest way to take a screenshot of an entire page is using the page.screenshot() method:

// Launch browser
const browser = await puppeteer.launch();

// Open page 
const page = await browser.newPage();
await page.goto(‘https://example.com‘);

// Screenshot
await page.screenshot({
  path: ‘fullpage.png‘ 
});

This captures the currently visible viewport. To screenshot the full page height, set the fullPage option to true:

await page.screenshot({
  path: ‘longpage.png‘,
  fullPage: true
}); 

Specifying Image Options

The screenshot() method accepts options to control the type, quality and more:

  • type – png, jpeg or webp. Default is png.
  • quality – For jpeg/webp, quality ranges from 0-100. Default is 80.
  • omitBackground – Hides default white background and allows transparency.
  • encoding – Can output as base64 instead of saving a file.

For example, to save a high quality jpeg:

await page.screenshot({
  path: ‘page.jpeg‘,
  type: ‘jpeg‘,
  quality: 100
});

Tip: Use webp for better compression with equivalent quality. However webp may have compatibility issues.

Dealing with Large Screenshots

Full page screenshots can easily exceed many megabytes in size. By default Puppeteer buffers screenshots in memory before saving which can exceed process limits.

To handle large screenshots, pass the option encoding: ‘base64‘ to get the base64 string instead of a Buffer. Then save using fs.writeFile() to avoid buffering the image in memory.

Here‘s an example:

const buffer = await page.screenshot({ encoding: ‘base64‘ });

fs.writeFile(‘screenshot.png‘, buffer, ‘base64‘, err => {
  // handle error 
});

Scrolling Tall Pages for Full Page Captures

To capture the full height of pages longer than the viewport, we‘ll need to scroll the page first.

Here‘s one approach using page.evaluate():

// Scroll to bottom  
await page.evaluate(() => {
  window.scrollTo(0, document.body.scrollHeight);
});

// Screenshot full scrollable area
await page.screenshot({ path: ‘longpage.png‘, fullPage: true });

We can also scroll incrementally taking screenshots, then stitch them together into a single tall screenshot. This prevents having to buffer the entire image in memory.

Alternative: Save as PDF

Another option for capturing full page content – generate a PDF!

// Generates PDF and saves to disk 
await page.pdf({
  path: ‘page.pdf‘,
  printBackground: true
});

Pros of PDFs:

  • Handles multi-page content out of the box.
  • Vector format usually results in smaller file sizes.
  • Print formatting stays intact.

Cons:

  • Less flexible for programmatic processing.
  • Limited styling options compared to images.
  • Might not capture dynamically rendered content.

Setting Viewport Size

By default Puppeteer uses a viewport of 800px x 600px. To get accurate full page screenshots on different desktop and mobile sizes, we can explicitly set the viewport:

// 1200px wide desktop 
await page.setViewport({
  width: 1200,
  height: 800  
});

// 400px wide mobile
await page.setViewport({
  width: 400,
  height: 1200 
});

Then screenshots will match the specified viewport size.

Capturing Elements

In addition to full page screenshots, we can capture screenshots of specific elements using element.screenshot().

// Get reference to element
const menu = await page.$(‘.main-menu‘);

// Screenshot just that element
await menu.screenshot({path: ‘menu.png‘});

The element will be scrolled into view before capturing the screenshot. This allows capturing shots of elements that might be offscreen without having to scroll to them.

Some use cases for element screenshots:

  • Capturing screenshots of dynamic components like tickers or animations.
  • Debugging layout issues by taking shots of individual elements.
  • Getting image assets of icons and illustrations.

Offscreen Element Screenshots

A common issue is elements being obscured or moved when trying to capture screenshots during interactions.

We can leverage the automatic element scrolling in element.screenshot() to reliably capture elements in any state, even when ofscreen:

// Click button which hides the element 
await page.click(‘.toggle-menu‘);

// Menu is now hidden but we can still screenshot it
await menu.screenshot({path: ‘hidden-menu.png‘}); 

This allows easy screenshotting without resetting the page state.

Waiting for Dynamic Content to Load

When working with dynamic pages, we‘ll want to wait for content to render before taking screenshots to capture the desired state.

Here‘s an example waiting for an element to appear:

// Click button to trigger ajax call
await page.click(‘.load-content‘);

// Wait for new content to load
await page.waitForSelector(‘.loaded‘);

// Screenshot after loaded
await page.screenshot({path: ‘loaded.png‘}); 

page.waitForSelector() waits until the selector exists in the DOM before proceeding.

Some other useful waits include:

  • page.waitFor() – Wait for a given condition to be true.
  • page.waitForFunction() – Wait for async DOM updates to complete.
  • page.waitUntil() – Wait until navigation occurs.

The key is picking the right wait condition for the page update you want to capture in a screenshot.

Waiting for Specific DOM Changes

To synchronize with more discrete DOM changes, we can wait for attributes to update instead of blanket selectors:

// Wait for text content to change
await page.waitForFunction(() => {
  return document.querySelector(‘.status‘).textContent === ‘Loaded‘; 
});

// Element updated  
await page.screenshot({/*...*/});

This approach works well for waiting on key data to load rather than static DOM changes.

Dealing with Single Page Apps (SPAs)

Waiting for DOM changes can be tricky with complex JavaScript SPAs that update state without reloading.

Some tips for handling these:

  • Wait for network idle after interactions to allow XHRs to complete.
  • Wait for specific components like overlays to disappear instead of blanket selectors.
  • Scroll to needed section to force rendering before taking screenshot.
  • Use incremental waits instead of fixed timeouts.

No single approach works perfectly for all SPAs. You‘ll have to experiment with the app in question.

Scrolling Pages Before Taking Full Page Screenshots

For pages that require scrolling, we‘ll need to programmatically scroll before taking a full screenshot with fullPage: true.

Here‘s a reliable approach:

await page.evaluate(() => {
  // Scroll to bottom
  window.scrollTo(0, document.body.scrollHeight);
}); 

// Capture full scrolled screenshot  
await page.screenshot({fullPage: true});

This scrolls the page down to the maximum scroll position before taking the screenshot.

An alternative is using window.scrollBy() to incrementally scroll a certain amount at a time. This allows taking continuous screenshots while scrolling down the full page length.

Handling Long Scrollable Pages

For extremely long pages, scrolling the entire length in one go might still exceed memory or time limits.

A good solution is to break it up into sections, scroll a bit at a time, take screenshots, and stitch them together:

const screenshots = [];

while (hasMoreContent()) {

  await page.evaluate(scrollDown);

  screenshots.push(await page.screenshot()); 

}

// Stitch screenshots together into one tall image

This prevents having to buffer the full page height in memory.

Scrolling Horizontally Too

For pages with horizontal scrolling, we can adjust the scroll sequence to also scroll horizontally:

await page.evaluate(() => {
  window.scrollTo(
    document.body.scrollWidth, 
    document.body.scrollHeight
  );
});

await page.screenshot({fullPage: true});

This captures the full page width and height!

Best Practices for Reliable Screenshots

Here are a few key tips for taking consistent, reliable screenshots with Puppeteer:

Wait for network idle – Use page.waitForNetworkIdle() after interactions to ensure all async requests complete before capturing state.

Use appropriate waits – Choose conditional waits that synchronize with the desired page state rather than blanket timeouts.

Set viewport size – Explicitly set the viewport to capture accurate device screenshots.

Shield from animations/popups – Hovering elements can trigger changes – use page.evaluate() to avoid side effects.

Allow time for rendering – Wait a few hundred milliseconds after scrolling for pages to finish rendering before screenshots.

Stabilize flaky tests – Set a retry loop with waits around screenshot steps to handle flakes.

Compare against known good – Leverage visual regression testing tools to catch unintended changes.

Conclusion

I hope this guide provided a comprehensive overview of taking full page and element screenshots with Puppeteer for your web scraping needs.

Some key topics we covered:

  • Using page.screenshot() and element.screenshot() to capture screenshots
  • Options for controlling image type, quality, format
  • Scrolling pages and waiting for dynamic content
  • Setting viewport size for responsive pages
  • Best practices for reliable screenshot workflows

Automated screenshots are invaluable for debugging scrapers, visual testing, and capturing dynamic states. Add them to your web scraping toolkit with Puppeteer!

Join the conversation

Your email address will not be published. Required fields are marked *