How to Download Files Using Puppeteer: The Ultimate Guide

If you‘re looking to automate downloading files from websites, Puppeteer is one of the most powerful and flexible tools at your disposal. Puppeteer is a Node library that provides a high-level API to control headless Chrome or Chromium browsers. With Puppeteer, you can generate PDFs of pages, crawl a SPA and generate pre-rendered content, automate form submission and UI testing, and much more.

One common task that Puppeteer excels at is downloading files from the web. However, automating file downloads comes with its own set of challenges:

By default, files are downloaded to the operating system‘s downloads folder, which can make it difficult to process the files further in your script
Downloading multiple files simultaneously or handling large file downloads can put a strain on memory and performance
You need to handle different approaches depending on if the download is initiated by clicking a button vs providing a direct file URL

But don‘t worry! In this guide, we‘ll walk through several detailed code examples to demonstrate how to robustly download files using Puppeteer and discuss best practices along the way. By the end, you‘ll be equipped to handle even the trickiest file download scenarios. Let‘s get started!

Downloading a File by Clicking a Button

First, let‘s take a look at the simple case of downloading a single file by clicking a download link or button on the page. Here‘s the HTML for a sample page with a download button:

<a href="/path/to/file.zip" class="download-btn">
  Download File
</a>

To automate clicking this button and saving the file with Puppeteer, we can use the following code:

const puppeteer = require(‘puppeteer‘);
const path = require(‘path‘);

const downloadPath = path.resolve(‘downloads‘);

(async function() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page._client.send(‘Page.setDownloadBehavior‘, {
    behavior: ‘allow‘,
    downloadPath: downloadPath
  });

  await page.goto(‘https://example.com/download‘);

  await page.click(‘.download-btn‘);

  // wait for download to complete
  await page.waitForTimeout(5000);

  await browser.close();
})();

The key steps are:

Launch a browser instance
Create a new page
Enable downloads in headless mode and specify the download directory
Navigate to the page with the download button
Click the button to initiate the download
Wait for the download to finish before closing the browser

By default, Puppeteer downloads files to the operating system‘s default downloads directory. To change this, we use Page.setDownloadBehavior to allow downloads in headless mode and specify a custom download path.

We wait a few seconds for the download to complete using page.waitForTimeout. For more precise control, you could use the page.waitForEvent method to wait for the download event.

Downloading Multiple Files via URLs

Sometimes download links are not exposed directly on the page as a clickable button, but are included in the page‘s HTML source. For example:

<ul class="download-links">
  <li>
    <a href="/files/doc1.pdf">Document 1</a>
  </li>
  <li>
    <a href="/files/doc2.pdf">Document 2</a>
  </li>
  <li>
    <a href="/files/doc3.pdf">Document 3</a>
  </li>
</ul>

To download all the linked files, we can extract the URLs using Puppeteer‘s page methods and download each file using the native https module:

const fs = require(‘fs‘);
const https = require(‘https‘);
const puppeteer = require(‘puppeteer‘);

(async function() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto(‘https://example.com/downloads‘);

  // Extract file URLs from the page
  const urls = await page.$$eval(‘.download-links a‘, links => {
    return links.map(link => link.href);
  });

  // Download each file
  for (const url of urls) {
    const fileName = path.basename(url);
    const file = fs.createWriteStream(fileName);

    https.get(url, response => {
      response.pipe(file);

      file.on(‘finish‘, () => {
        file.close();
        console.log(`Downloaded ${fileName}`);
      });
    });
  }

  await browser.close();
})();

We use page.$$eval to run document.querySelectorAll within the page context to extract all the URLs of files we want to download. Then we download each file using https.get, piping the response to a writable file stream.

This approach has a few advantages over clicking download links:

The files start downloading immediately without needing to navigate to each download link separately
It‘s easy to customize the destination filename
You can parallelize the downloads for better performance (coming up next!)

Parallel File Downloads Using Child Processes

If you need to download many files, doing so sequentially can be prohibitively slow. Since Puppeteer runs in Node.js, which is single-threaded by default, it can only download one file at a time.

However, we can work around this limitation by spawning child processes to download files in parallel. Child processes run in a separate instance of the V8 JavaScript engine, allowing multiple file downloads to be executed concurrently.

Here‘s an example of how to implement parallel file downloads using Node.js child processes and Puppeteer:

// main.js

const { fork } = require(‘child_process‘);

(async function() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto(‘https://example.com/downloads‘);

  // Extract file URLs
  const urls = await page.$$eval(‘.download-links a‘, links => {
    return links.map(link => link.href);
  });

  // Spawn a child process for each file download
  const numProcesses = urls.length;
  const children = [];

  for (let i = 0; i < numProcesses; i++) {
    const child = fork(‘download-worker.js‘);
    children.push(child);

    child.send(urls[i]);

    child.on(‘message‘, () => {
      console.log(`Child process ${child.pid} finished`);

      const childrenDone = children.filter(child => !child.connected).length;
      if (childrenDone === numProcesses) {
        console.log(‘All downloads complete‘);
      }
    });
  }

  await browser.close();
})();

// download-worker.js
const fs = require(‘fs‘);
const https = require(‘https‘);
const path = require(‘path‘);

process.on(‘message‘, url => {
  const fileName = path.basename(url);
  const file = fs.createWriteStream(fileName);

  https.get(url, response => {
    response.pipe(file);

    file.on(‘finish‘, () => {
      file.close();
      process.send(‘Done‘);
      process.exit();
    });
  });
});

The main process spawns a child process for each file download. The file URLs are passed to the child processes via IPC messages. Each child process downloads its assigned file and notifies the main process when it‘s done via IPC. The main process keeps track of how many child processes have finished and logs a message when all downloads are complete.

Using this pattern, you can parallelize downloads and drastically speed up the overall runtime compared to downloading files sequentially. Just be careful not to spawn too many child processes simultaneously to avoid overwhelming system resources.

Conclusion

Puppeteer is a versatile and powerful tool for automating all kinds of web-related tasks, including downloading files. Whether you need to download a single file, multiple files, or optimize download performance, Puppeteer has you covered.

The examples covered in this guide should give you a solid foundation for handling a variety of file download scenarios with Puppeteer. Here are a few key takeaways:

Use Page.setDownloadBehavior to enable downloads in headless mode and specify a custom download directory
Decide if simulating a button click or extracting file URLs from the page HTML is the best approach
Take advantage of child processes to parallelize file downloads for better performance
Always close the browser instance at the end of your script

Armed with this knowledge, you‘re well on your way to mastering automated file downloads using Puppeteer and Node.js. Happy downloading!

Additional Resources

Want to learn more about Puppeteer and web scraping? Check out these helpful resources:

Official Puppeteer documentation: https://pptr.dev
Puppeteer API overview: https://devdocs.io/puppeteer
Web Scraping with Puppeteer (blog post): https://www.scrapingbee.com/blog/web-scraping-with-puppeteer/
Advanced Web Scraping Tactics (online course): https://www.udemy.com/course/advanced-web-scraping

Downloading a File by Clicking a Button

Downloading Multiple Files via URLs

Parallel File Downloads Using Child Processes

Conclusion

Additional Resources

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide