Skip to content

How to scrape the web with Playwright in 2024

Playwright is a powerful browser automation library that makes web scraping exceptionally easy in 2024. Developed by Microsoft and open-sourced in 2019, Playwright provides a first-class API to control Chromium, Firefox and WebKit.

If you‘ve used Puppeteer for web scraping before, then you‘ll feel right at home with Playwright. The APIs are very similar, but Playwright builds on Puppeteer and introduces many improvements.

In this comprehensive tutorial, you‘ll learn how to:

  • Set up a Playwright project
  • Launch browsers and open pages
  • Interact with elements on a page
  • Extract data from websites
  • Handle pagination and crawl between pages
  • Manage errors and retries
  • Store scraped data
  • Deploy Playwright scrapers to the cloud

We‘ll be building a scraper for GitHub repository data, but the concepts apply to any web scraping project.

Getting Started with Playwright

Let‘s start by setting up a new project directory and installing Playwright:

npm init -y
npm install playwright

To confirm that Playwright is installed correctly, create an index.js file and add the following code:

// index.js

const { chromium } = require(‘playwright‘); 

(async () => {

  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto(‘https://example.com‘);

  await browser.close();

})();

This launches Chromium, opens a new tab, navigates to example.com, and then closes the browser.

Run the code with:

node index.js

You should see Chromium flash open and navigate to the website!

Launching Browsers

The first thing any Playwright script needs to do is launch a browser instance using chromium.launch(), firefox.launch() or webkit.launch().

You can configure things like headless mode, user agent, geolocation and more in the launch options:

const browser = await chromium.launch({
  headless: false,
  slowMo: 50 // Slow down Playwright by 50ms per action
});

Then open pages in the browser with browser.newPage():

const page = await browser.newPage();

Make sure to close the browser when finished:

await browser.close();

To make Playwright open a URL, use the page.goto() method:

await page.goto(‘https://example.com‘);

You can navigate back and forth in history with page.goBack() and page.goForward().

Interacting with Page Elements

Playwright makes it easy to interact with elements on a page, like clicking buttons, filling out forms and more.

To click an element with text "Sign up", use:

await page.click(‘text="Sign up"‘); 

This will wait until the element is visible before clicking.

Other actions like typing text work similarly:

await page.fill(‘input[name="email"]‘, ‘[email protected]‘);

This locates the email input by name attribute, waits for it to appear, clears any existing value and types the text.

Playwright has first-class support for selecting elements with CSS selectors and XPath expressions. For example:

const submit = await page.$(‘#submit‘);
await submit.click();

This gets the submit button by its ID, saves it to a variable, and clicks it.

Extracting Data

To extract text or attributes from elements, use the element.textContent() and element.getAttribute() methods:

const title = await page.$(‘h1‘).textContent(); 

const src = await page.$(‘img‘).getAttribute(‘src‘);

Playwright also provides powerful API methods like page.textContent() and page.$$eval() to extract data directly:

// Get entire page text
const pageText = await page.textContent();

// Extract text from all .paragraph elements 
const paragraphs = await page.$$eval(‘.paragraph‘, elems => {
  return elems.map(elem => elem.textContent); 
});

Waiting for Elements

One difficulty with browser automation is waiting for elements or pages to load fully before taking the next action.

Playwright has useful built-in wait functions like page.waitForSelector():

// Wait for #login to exist before clicking 
await page.waitForSelector(‘#login‘);
await page.click(‘#login‘);

There are also methods like page.waitForTimeout() and page.waitForFunction() to wait for arbitrary conditions.

// Wait 5 seconds
await page.waitForTimeout(5000);

// Wait for page title to contain "Dashboard"
await page.waitForFunction(
  () => document.title.includes(‘Dashboard‘)  
);

This ensures actions are taken at the right time and prevents flaky scripts.

Handling Pagination

Many websites have pagination where data is spread across multiple pages. Playwright makes it straightforward to automate clicking "Next" buttons and extracting data from all pages.

For example:

let currentPage = 1;
let nextButton; 

do {

  // Extract data from current page
  // ...

  // Click next button if it exists
  nextButton = await page.$(‘text="Next"‘);
  if (nextButton) {
    await nextButton.click();
    currentPage++; 
  }

} while (nextButton); 

This paginates until the "Next" button no longer exists. The same logic works for an infinite scroll website by scrolling down and waiting for more data to load.

Managing Errors and Retries

Any scraper that runs for a long time is bound to encounter intermittent errors and failures. Playwright has tools to help handle errors gracefully.

Wrap code in try/catch blocks to handle specific errors:

try {
  await page.click(‘button‘); 
} catch (error) {
  // Handle error
}

You can also pause execution before retrying on failure with:

const { Browser, BrowserContext, Page } = require(‘playwright‘);

const maxRetries = 5;
let retries = 0;

const browser = await Browser.launch();

while (retries <= maxRetries) {
  try {
    // Open page and scrape
    await doScrape(browser); 
    break; 
  } catch (error) {
    retries++;
    console.warn(`Retry #${retries}`);
  }
}

await browser.close();

async function doScrape(browser) {
  const context = await browser.newContext();
  const page = await context.newPage();

  // Scrape page ..

  await context.close(); 
}

This allows retries up to a max while isolating each attempt in a new browser context.

Storing Scraped Data

To store scraped data, you can write results to a JSON file:

const { writeFileSync } = require(‘fs‘);

// Scrape data...
const results = []; 

writeFileSync(‘data.json‘, JSON.stringify(results)); 

For larger datasets, insert into a database like MySQL, MongoDB or Elasticsearch. The Apify SDK provides easy data exporters for various databases and cloud storage services.

Deploying to the Cloud

While Playwright scripts can run locally, for serious scraping projects you‘ll want to deploy them to the cloud. This brings scalability, reliability and efficiency.

Popular cloud platforms like Apify provide all the tools needed to run Playwright at scale, including proxies, job queues, computing power and storage.

To deploy on Apify, you bundle your Playwright script into an Apify Actor Docker container. This container can be run on the Apify platform via the API or web UI.

The platform automatically scales up scraper concurrency, manages failures, enforces limits, and stores results. No servers to manage!

Scraping GitHub Topics

Now let‘s put our Playwright knowledge to work by building a scraper for GitHub repository data.

The goal is to scrape information like repository name, URL, description, stars count and more for the top JavaScript repositories on GitHub.

We‘ll use the GitHub Topics page as our starting point.

Project setup

Create a new project directory:

mkdir github-scraper
cd github-scraper
npm init -y
npm install playwright

Launching the browser

First we‘ll launch a Playwright Chromium browser instance and open a new page.

Create index.js:

// index.js

const { chromium } = require(‘playwright‘);

(async () => {

  const browser = await chromium.launch({ headless: false });
  const page = await browser.newPage();

  await page.goto(‘https://github.com/topics/javascript‘);

  await browser.close();

})(); 

Run this with node index.js and Chromium should open to the JavaScript topics page.

Paginating through repositories

There are only 30 repositories shown initially. We need to click the "Load more" button at the page bottom to reveal more.

To automate this, we‘ll use page.click() and check for the button‘s existence with page.$(selector):

// Click "Load more" button until it disappears
let loadMoreButton;
do {
  loadMoreButton = await page.$(‘text="Load more repositories"‘);
  if (loadMoreButton) {
    await loadMoreButton.click();
  }
} while (loadMoreButton);

We also need to wait for the page to update after clicking. page.waitForSelector() works well here:

await page.waitForSelector(‘text="Load more repositories"‘); 

This waits until new repositories have loaded before continuing.

Extracting repository data

Now we can extract details for each repository card shown on the page.

First get all the .repo-list-item elements:

const items = await page.$$(‘.repo-list-item‘);

Then use element.textContent() and element.getAttribute() to extract text and attribute values from each:

const results = [];

for (const item of items) {
  const titleElement = await item.$(‘.repo-list-name a‘);
  const title = await titleElement.textContent();
  const url = await titleElement.getAttribute(‘href‘);

  const description = await item.$(‘.repo-list-description‘).textContent();

  // Get other data like stars count..

  results.push({
    title, 
    url,
    description
    // ... other properties 
  });
}

This data can then be logged, saved to a file, uploaded to a database, etc.

Retrying on errors

GitHub occasionally fails with a 429 "Too Many Requests" error. We can retry on this error specifically:

// Wrap in async function
const scrape = async () => {

  let retries = 0;

  while (retries < 3) {
    try {
      // Scrape page

      break; 
    } catch (error) {
      if (error.message.includes(‘429‘)) {
        retries++;
        console.log(‘Retrying...‘);
        continue;
      }
      throw error;
    }
  }

}

scrape();

This retries up to 3 times on the 429 status code while allowing other errors to bubble up.

Result output

For this basic example, we‘ll log the results to console:

console.table(results.slice(0, 10)); 

The full code so far is available on GitHub.

While Playwright handles the browser automation, for real-world scraping projects you‘ll want a framework like Apify to simplify storage, proxy rotation, job queues and more.

What‘s next?

This tutorial covered the fundamentals of web scraping with Playwright, including:

  • Launching browsers and navigating pages
  • Interacting with page elements
  • Extracting data
  • Waiting for loading
  • Paginating through data
  • Managing errors and retries
  • Storing scraped results

Playwright is perfect for automating complex sites and building robust scrapers. Paired with a cloud scraping platform, it can power full-scale data collection projects.

To dive deeper into Playwright, check out:

Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *