How to Download Files with Puppeteer in Node.js

Puppeteer is a popular Node.js library that provides high-level APIs to control headless Chrome. It can be used for web scraping, browser testing, screenshots, PDF generation, and more. A common task is to download files from web pages using Puppeteer. In this comprehensive guide, we‘ll explore the various approaches to downloading files with Puppeteer in Node.js.

Overview of Approaches

There are a few main ways we can download files using Puppeteer:

Use page.click() to click on a download button or link. This will trigger the browser‘s native download dialog.
Get the file URL and use page.evaluate() to fetch the file contents directly.
Set the browser download behavior and path to automatically save files.

Let‘s look at each approach in more detail.

Click Download Button

The simplest way is to locate the download button or link on the page, and click on it using page.click(). For example:

// Navigate to page
await page.goto(‘https://example.com‘);

// Click on download button 
await page.click(‘#download-btn‘);

This will cause the browser to initiate a download, prompting the native download dialog.

The advantages of this approach are:

Very simple to implement
Works for any type of downloadable file
Handles authentication and cookies properly

The disadvantages are:

Requires manual intervention to save the file from the dialog
Harder to save the file programmatically

Overall, this is a good option if you just need to manually download a file occasionally.

Fetch File Contents Directly

Instead of clicking a link, we can get the file URL and fetch the contents directly using page.evaluate().

For example:

// Get the file url 
const url = await page.$eval(‘#downloadLink‘, el => el.href);

// Fetch the file contents 
const file = await page.evaluate(url => {
  return fetch(url).then(res => res.text()); 
}, url);

// Save the file
fs.writeFileSync(‘file.csv‘, file);

This approach has some advantages:

Handles cookies and auth automatically
Saves file entirely in Node.js without dialogs
Can download multiple files in parallel

The disadvantages are:

More complex code
Only works for textual content, not binary files

Overall, this is a robust approach for programmatic downloading of files.

Automatically Save Downloads

We can also configure the browser to automatically save downloads to a specific path.

// Set download options 
await page._client.send(‘Page.setDownloadBehavior‘, {
  behavior: ‘allow‘, 
  downloadPath: ‘./downloads‘
});

// Click download button
await page.click(‘#pdf-download‘); 

// File is saved to ./downloads automatically!

The benefits of this approach are:

Very simple to implement
Perfect for downloading many files
Downloads binaries and other formats

The disadvantages are:

Browser needs to support the downloadPath option
Harder to handle dynamically-generated filenames

This works great if you just need a simple way to save files on an automated workflow.

Handling Dynamic Filenames

One challenge with automated downloading is handling dynamic filenames that change every time.

We can wait for the page.on(‘response‘) event to capture the file URL on each request:

page.on(‘response‘, response => {
  const url = response.url();
  const filename = url.split(‘/‘).pop(); 
  if (filename.endsWith(‘.pdf‘)) {
    response.saveToDisk(filename);
  } 
});

This allows us to dynamically save the file with the right name every time.

Downloading Multiple Files

To download multiple files in parallel, we can call page.evaluate() concurrently:

// Array of urls
const urls = [
  ‘https://example.com/file1.csv‘,
  ‘https://example.com/file2.csv‘
];

// Download in parallel
const promises = urls.map(async url => {
  return page.evaluate(fetchFile, url); 
});

// Save all files
const files = await Promise.all(promises);

This enables very fast simultaneous downloading of multiple files.

Handling Authentication

One important consideration with downloading files is proper authentication. Because Puppeteer executes in the context of the browser, it will automatically handle any cookies or auth headers that are required to access the files.

This means we don‘t need to explicitly manage authentication in most cases – the browser will send the correct cookies to download restricted files.

Conclusion

Downloading files with Puppeteer provides some very useful capabilities for automation and testing. Some key takeaways:

Use page.click() for simple manual downloads
Fetch the file directly when programmatic access is required
Configure downloadPath for seamless saving of files
Handle dynamic filenames by listening for the response
Download multiple files in parallel for performance

With these approaches, you can robustly implement file downloading in your Puppeteer scripts. Proper error handling and retry mechanisms should also be utilized for reliability.

Overview of Approaches

Click Download Button

Fetch File Contents Directly

Automatically Save Downloads

Handling Dynamic Filenames

Downloading Multiple Files

Handling Authentication

Conclusion

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python