How to scrape the web with Puppeteer in 2024

Puppeteer is a popular JavaScript library for controlling headless Chrome or Chromium browsers. It provides a high-level API for automating tasks like browsing, clicking elements, filling forms, and extracting data from web pages. In this comprehensive guide, you‘ll learn how to build a web scraper from scratch with Puppeteer in 2024.

What is Puppeteer?

Puppeteer is developed and maintained by the Chrome DevTools team. It‘s built on top of the DevTools Protocol, which is the same interface that powers Chrome DevTools.

Some key things to know about Puppeteer:

It can control headless Chrome/Chromium or run a full version of Chrome/Chromium.
It runs on Node.js and works cross-platform on Windows, Mac, and Linux.
The API is promise-based and easy to use with async/await syntax.
It provides high-level APIs for automation like page.click() and page.type().
It also supports lower-level DOM manipulation and evaluation of JavaScript code in the browser context.

In summary, Puppeteer enables you to automate everything that a human can do manually in the browser. This makes it a versatile tool for web scraping, test automation, or anything that requires browser control.

Puppeteer vs Playwright vs Selenium

There are a few popular browser automation frameworks available:

Puppeteer – Focused on Chrome/Chromium only. Lightweight and easy to use.
Playwright – Supports Chrome, Firefox and WebKit. API is similar to Puppeteer.
Selenium – Supports multiple browsers but requires browser-specific drivers to be installed. More heavyweight.

For most web scraping use cases, Puppeteer and Playwright have very similar capabilities. Puppeteer has the advantage of being lightweight with faster performance. Playwright supports more browsers but you likely only need Chrome or Chromium for web scraping.

Overall, Puppeteer is a great choice for building scrapers thanks to its simple API and Chromium-only focus.

Setting up a Puppeteer project

Let‘s set up a new Node.js project from scratch with Puppeteer.

First, create a new directory for your project:

mkdir puppeteer-scraper
cd puppeteer-scraper

Next, initialize NPM:

npm init -y

This will create a package.json file with default values.

Then install Puppeteer:

npm install puppeteer

Now create an index.js file where we‘ll write our scraper code:

touch index.js

That‘s it for setup! We‘re ready to start coding.

Launching a browser with Puppeteer

The first thing to do is launch a Puppeteer browser instance. This will download a Chromium binary if one isn‘t already installed, and give you access to browser pages.

Here‘s how to launch a headless browser:

// index.js

const puppeteer = require(‘puppeteer‘);

(async () => {

  const browser = await puppeteer.launch();

})();

By default Puppeteer runs Chromium in headless mode. This means the browser UI will not be visible. For debugging, you can run Puppeteer in headful mode:

const browser = await puppeteer.launch({headless: false});

Now we can open pages in the browser:

const page = await browser.newPage();

Let‘s navigate to a sample page:

await page.goto(‘https://example.com‘);

And close the browser when done:

await browser.close();

Put together, our script looks like:

const puppeteer = require(‘puppeteer‘);

(async () => {

  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(‘https://example.com‘);

  await browser.close();

})();

This demonstrates the basic browser launch workflow with Puppeteer.

Clicking elements on a page

Automating clicks is a common need for scrapers. Puppeteer makes this easy with page.click().

For example, to click a link or button with text "Click me":

await page.click(‘a[href="/next-page"]‘);

This will find the a anchor tag with matching href and click it.

You can also click elements by other selectors like ID or class name:

// Click element by ID
await page.click(‘#submit-button‘);

// Click element by class name  
await page.click(‘.continue-link‘);

To click arbitrarily positioned elements like images, use page.click() with an object containing the x/y position:

await page.click({x: 100, y: 150});

In summary, page.click() automatically finds clickable elements and triggers a mouse click on them.

Filling and submitting forms

Scrapers often need to fill forms before extracting data from a website. Puppeteer provides a few ways to automate form interaction.

To fill a text input or textarea, use page.type():

await page.type(‘#first-name‘, ‘John‘);

await page.type(‘#bio‘, ‘This is my bio text...‘);

For non-text input types like checkboxes, radio buttons and dropdowns, use page.select():

// Select a checkbox or radio button by ID
await page.select(‘#newsletter-signup‘, true);

// Select an option from a dropdown 
await page.select(‘#gender‘, ‘Male‘);

To submit a form after filling it:

await page.click(‘#submit-button‘);

In summary:

page.type() works for text inputs and textareas
page.select() works for checkboxes, radio buttons and dropdowns
page.click() submits the form

This covers the basics of automating form entry with Puppeteer.

Waiting for elements to load

Dynamic websites load content asynchronously. Before clicking or reading an element, you need to wait for it to load on the page.

Here are some ways to wait in Puppeteer:

// Wait for the selector to appear in page
await page.waitForSelector(‘div.loaded‘); 

// Wait for 1500 ms 
await page.waitFor(1500);

// Wait for a function to return true 
await page.waitForFunction(() => {
  return document.querySelector(‘div.loaded‘) !== null;
});

page.waitForSelector() is the most commonly used, but page.waitForFunction() allows writing more complex conditions.

Always use waits before trying to interact with elements to avoid race conditions.

Extracting text from elements

A key part of any scraper is extracting text or data from a page. Puppeteer provides a few options to extract text from elements.

Given this HTML:

<div class="title">

</div>

<p class="description">
  This domain is for use in illustrative examples in documents. 
</p>

We can extract text using:

// Get textContent of element by selector
const title = await page.$eval(‘.title‘, el => el.textContent); 

// Get innerText of element by selector 
const description = await page.$eval(‘.description‘, el => el.innerText);

Both textContent and innerText work, with some subtle differences in how they handle whitespace and newline characters.

Scraping data from multiple pages

Very often, scrapers need to collect data from multiple pages, not just a single URL. Here‘s one way to scrape multiple pages with Puppeteer:

// Navigate to the first page
await page.goto(URL);

// Extract data from page 1 

// Find link to next page
const nextPage = await page.$eval(‘a.next‘, a => a.getAttribute(‘href‘));

// Click the next page link
await page.click(‘a.next‘); 

// Wait for page load
await page.waitForNavigation();

// Extract data from page 2

// Find link to page 3...

This allows you to click through paginated content and aggregate data from all pages.

To scale this to large numbers of pages, you would want to:

Store links to be crawled in a queue
Launch multiple browser instances
Add delays and timeouts
Rotate IPs or use proxies

But the general approach remains the same – navigate, extract, queue next pages.

Handling browser dialogs

Sometimes pages trigger alert, confirm or prompt dialogs that need to be handled:

// Override page dialogs
page.on(‘dialog‘, dialog => {

  // Accept prompts and confirms
  if (dialog.type() === ‘prompt‘ || dialog.type() === ‘confirm‘) {
    dialog.accept();
  }

  // Dismiss alerts
  if (dialog.type() === ‘alert‘) {
    dialog.dismiss();
  }

});

Now any dialogs that appear during scraping will be automatically handled.

Executing custom JavaScript in the browser

For advanced cases, you may need to inject and execute custom JavaScript directly in the browser context.

Puppeteer provides a few ways to run JS in the page:

// Execute JS expression in page
await page.evaluate(() => {
  // Can access DOM here
  const title = document.querySelector(‘title‘).textContent;
  return title;
});

// Pass arguments from Node.js to browser JS  
const title = await page.evaluate(selector => {
  return document.querySelector(selector).textContent;
}, ‘title‘); 

// Execute function defined in Node.js
const getTitle = () => {
  return document.querySelector(‘title‘).textContent;
};

const title = await page.evaluate(getTitle);

page.evaluate() allows powerful flexibility to compute data directly in the browser.

Scraping best practices

Here are some best practices to keep in mind when building scrapers:

Use headless mode and disable images/media for efficiency
Implement random delays between actions to mimic human behavior
Limit scrape rate to avoid overwhelming target sites
Rotate proxies/IPs to avoid blocks
Use OCR if scraping complex data like charts or graphs
Store extracted data incrementally to avoid data loss
Use middleware to handle cookies, caching, retries, etc

And most importantly, respect robots.txt rules and do not overload sites you scrape!

Handling dynamic content

Modern sites rely heavily on JavaScript to render content. By default Puppeteer only sees the initial HTML that is returned from the server – before JS has run to populate the page.

To allow the page to load dynamically:

// Wait until network is idle - all async actions finished
await page.waitForNavigation({waitUntil: ‘networkidle0‘});

// Additional wait after last network request  
await page.waitFor(500);

This waits for async actions like XHR/fetch requests to finish after the HTML has loaded.

For highly complex apps like React SPAs, you may need to wait until a certain component renders:

await page.waitForFunction(() => {
  return document.querySelector(‘#main-content‘) !== null; 
});

The key is identifying a element unique to the fully loaded state and waiting for it before scraping.

Browser automation with Crawlee

Crawlee is a web crawling framework that integrates with Puppeteer to make browser automation easier.

It provides a high-level API and handles cross-page state management:

// Create a Crawlee browser instance
const crawler = new Crawlee.PuppeteerCrawler();

// Register async page function  
crawler.addHandler(async ({ page, request }) => {

  // Access Puppeteer APIs  
  await page.waitForSelector(‘#result‘);

  // Extract data into JSON
  const data = {title: await page.title()}; 

  // Save extracted data 
  await crawler.saveData(data);

});

// Queue list of URLs to crawl  
crawler.addRequests(urls); 

// Run the crawler across multiple pages  
await crawler.run();

Crawlee removes boilerplate like queue management, state handling and error retries. It makes it easier to focus on writing page logic.

For complex scraping projects, Crawlee is worth exploring to simplify browser automation at scale.

Deploying a Puppeteer scraper

While great for development, running a scraper on your own machine is not ideal for production. The scraper will go offline when your computer shuts down.

For reliability, it‘s best to deploy scraper to a server or scraping platform designed for automation. Here are some good options:

Scrapinghub – Popular platform focused on Scrapy, but supports deploying Node.js scrapers
Puppeteer Sandbox – Hosted service specialized for Puppeteer/Playwright scrapers
Apify – Robust web scraping platform with browser automation support
Scaleless – Scraping-oriented cloud platform with integrated proxies and results storage

These platforms provide hosted infrastructure to run scrapers 24/7 without maintenance overhead. They typically include helpful tools like visual debugging, built-in proxies, and storage for scraped data.

Deploying to a platform removes headaches around reliability and scaling, allowing you to focus on building the best scraper.

Conclusion

This guide covered core Puppeteer concepts like launching a browser, automating interactions, handling dynamic content, extracting data, and deploying scrapers.

With these fundamentals, you can use Puppeteer to build everything from simple one-off scrapers to large scale web crawlers.

The key is starting with well-structured, maintainable code. Use async/await for readability. Break code into reusable modules. Leverage tools like Crawlee for cross-page orchestration. Stay organized and plan for scale from the beginning.

Web scraping is an iterative process. As the web evolves, browsers and scrapers have to evolve along with it. But the fundamental techniques of extracting, parsing and structuring data remain consistent.

Hopefully this provides a solid foundation for using Puppeteer to scrape and understand the modern web. Scraping can be challenging, but also extremely rewarding. By automating access to the wealth of knowledge online, scrapers enable data analysis at unprecedented scale.

Happy scraping!