Playwright is a powerful browser automation library that makes web scraping exceptionally easy in 2024. Developed by Microsoft and open-sourced in 2019, Playwright provides a first-class API to control Chromium, Firefox and WebKit.
If you‘ve used Puppeteer for web scraping before, then you‘ll feel right at home with Playwright. The APIs are very similar, but Playwright builds on Puppeteer and introduces many improvements.
In this comprehensive tutorial, you‘ll learn how to:
- Set up a Playwright project
- Launch browsers and open pages
- Interact with elements on a page
- Extract data from websites
- Handle pagination and crawl between pages
- Manage errors and retries
- Store scraped data
- Deploy Playwright scrapers to the cloud
We‘ll be building a scraper for GitHub repository data, but the concepts apply to any web scraping project.
Getting Started with Playwright
Let‘s start by setting up a new project directory and installing Playwright:
npm init -y
npm install playwright
To confirm that Playwright is installed correctly, create an index.js file and add the following code:
// index.js
const { chromium } = require(‘playwright‘);
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto(‘https://example.com‘);
await browser.close();
})();
This launches Chromium, opens a new tab, navigates to example.com, and then closes the browser.
Run the code with:
node index.js
You should see Chromium flash open and navigate to the website!
Launching Browsers
The first thing any Playwright script needs to do is launch a browser instance using chromium.launch()
, firefox.launch()
or webkit.launch()
.
You can configure things like headless mode, user agent, geolocation and more in the launch options:
const browser = await chromium.launch({
headless: false,
slowMo: 50 // Slow down Playwright by 50ms per action
});
Then open pages in the browser with browser.newPage()
:
const page = await browser.newPage();
Make sure to close the browser when finished:
await browser.close();
Navigating to Pages
To make Playwright open a URL, use the page.goto()
method:
await page.goto(‘https://example.com‘);
You can navigate back and forth in history with page.goBack()
and page.goForward()
.
Interacting with Page Elements
Playwright makes it easy to interact with elements on a page, like clicking buttons, filling out forms and more.
To click an element with text "Sign up", use:
await page.click(‘text="Sign up"‘);
This will wait until the element is visible before clicking.
Other actions like typing text work similarly:
await page.fill(‘input[name="email"]‘, ‘[email protected]‘);
This locates the email input by name attribute, waits for it to appear, clears any existing value and types the text.
Playwright has first-class support for selecting elements with CSS selectors and XPath expressions. For example:
const submit = await page.$(‘#submit‘);
await submit.click();
This gets the submit button by its ID, saves it to a variable, and clicks it.
Extracting Data
To extract text or attributes from elements, use the element.textContent()
and element.getAttribute()
methods:
const title = await page.$(‘h1‘).textContent();
const src = await page.$(‘img‘).getAttribute(‘src‘);
Playwright also provides powerful API methods like page.textContent()
and page.$$eval()
to extract data directly:
// Get entire page text
const pageText = await page.textContent();
// Extract text from all .paragraph elements
const paragraphs = await page.$$eval(‘.paragraph‘, elems => {
return elems.map(elem => elem.textContent);
});
Waiting for Elements
One difficulty with browser automation is waiting for elements or pages to load fully before taking the next action.
Playwright has useful built-in wait functions like page.waitForSelector()
:
// Wait for #login to exist before clicking
await page.waitForSelector(‘#login‘);
await page.click(‘#login‘);
There are also methods like page.waitForTimeout()
and page.waitForFunction()
to wait for arbitrary conditions.
// Wait 5 seconds
await page.waitForTimeout(5000);
// Wait for page title to contain "Dashboard"
await page.waitForFunction(
() => document.title.includes(‘Dashboard‘)
);
This ensures actions are taken at the right time and prevents flaky scripts.
Handling Pagination
Many websites have pagination where data is spread across multiple pages. Playwright makes it straightforward to automate clicking "Next" buttons and extracting data from all pages.
For example:
let currentPage = 1;
let nextButton;
do {
// Extract data from current page
// ...
// Click next button if it exists
nextButton = await page.$(‘text="Next"‘);
if (nextButton) {
await nextButton.click();
currentPage++;
}
} while (nextButton);
This paginates until the "Next" button no longer exists. The same logic works for an infinite scroll website by scrolling down and waiting for more data to load.
Managing Errors and Retries
Any scraper that runs for a long time is bound to encounter intermittent errors and failures. Playwright has tools to help handle errors gracefully.
Wrap code in try/catch
blocks to handle specific errors:
try {
await page.click(‘button‘);
} catch (error) {
// Handle error
}
You can also pause execution before retrying on failure with:
const { Browser, BrowserContext, Page } = require(‘playwright‘);
const maxRetries = 5;
let retries = 0;
const browser = await Browser.launch();
while (retries <= maxRetries) {
try {
// Open page and scrape
await doScrape(browser);
break;
} catch (error) {
retries++;
console.warn(`Retry #${retries}`);
}
}
await browser.close();
async function doScrape(browser) {
const context = await browser.newContext();
const page = await context.newPage();
// Scrape page ..
await context.close();
}
This allows retries up to a max while isolating each attempt in a new browser context.
Storing Scraped Data
To store scraped data, you can write results to a JSON file:
const { writeFileSync } = require(‘fs‘);
// Scrape data...
const results = [];
writeFileSync(‘data.json‘, JSON.stringify(results));
For larger datasets, insert into a database like MySQL, MongoDB or Elasticsearch. The Apify SDK provides easy data exporters for various databases and cloud storage services.
Deploying to the Cloud
While Playwright scripts can run locally, for serious scraping projects you‘ll want to deploy them to the cloud. This brings scalability, reliability and efficiency.
Popular cloud platforms like Apify provide all the tools needed to run Playwright at scale, including proxies, job queues, computing power and storage.
To deploy on Apify, you bundle your Playwright script into an Apify Actor Docker container. This container can be run on the Apify platform via the API or web UI.
The platform automatically scales up scraper concurrency, manages failures, enforces limits, and stores results. No servers to manage!
Scraping GitHub Topics
Now let‘s put our Playwright knowledge to work by building a scraper for GitHub repository data.
The goal is to scrape information like repository name, URL, description, stars count and more for the top JavaScript repositories on GitHub.
We‘ll use the GitHub Topics page as our starting point.
Project setup
Create a new project directory:
mkdir github-scraper
cd github-scraper
npm init -y
npm install playwright
Launching the browser
First we‘ll launch a Playwright Chromium browser instance and open a new page.
Create index.js:
// index.js
const { chromium } = require(‘playwright‘);
(async () => {
const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();
await page.goto(‘https://github.com/topics/javascript‘);
await browser.close();
})();
Run this with node index.js
and Chromium should open to the JavaScript topics page.
Paginating through repositories
There are only 30 repositories shown initially. We need to click the "Load more" button at the page bottom to reveal more.
To automate this, we‘ll use page.click()
and check for the button‘s existence with page.$(selector)
:
// Click "Load more" button until it disappears
let loadMoreButton;
do {
loadMoreButton = await page.$(‘text="Load more repositories"‘);
if (loadMoreButton) {
await loadMoreButton.click();
}
} while (loadMoreButton);
We also need to wait for the page to update after clicking. page.waitForSelector()
works well here:
await page.waitForSelector(‘text="Load more repositories"‘);
This waits until new repositories have loaded before continuing.
Extracting repository data
Now we can extract details for each repository card shown on the page.
First get all the .repo-list-item
elements:
const items = await page.$$(‘.repo-list-item‘);
Then use element.textContent()
and element.getAttribute()
to extract text and attribute values from each:
const results = [];
for (const item of items) {
const titleElement = await item.$(‘.repo-list-name a‘);
const title = await titleElement.textContent();
const url = await titleElement.getAttribute(‘href‘);
const description = await item.$(‘.repo-list-description‘).textContent();
// Get other data like stars count..
results.push({
title,
url,
description
// ... other properties
});
}
This data can then be logged, saved to a file, uploaded to a database, etc.
Retrying on errors
GitHub occasionally fails with a 429 "Too Many Requests" error. We can retry on this error specifically:
// Wrap in async function
const scrape = async () => {
let retries = 0;
while (retries < 3) {
try {
// Scrape page
break;
} catch (error) {
if (error.message.includes(‘429‘)) {
retries++;
console.log(‘Retrying...‘);
continue;
}
throw error;
}
}
}
scrape();
This retries up to 3 times on the 429 status code while allowing other errors to bubble up.
Result output
For this basic example, we‘ll log the results to console:
console.table(results.slice(0, 10));
The full code so far is available on GitHub.
While Playwright handles the browser automation, for real-world scraping projects you‘ll want a framework like Apify to simplify storage, proxy rotation, job queues and more.
What‘s next?
This tutorial covered the fundamentals of web scraping with Playwright, including:
- Launching browsers and navigating pages
- Interacting with page elements
- Extracting data
- Waiting for loading
- Paginating through data
- Managing errors and retries
- Storing scraped results
Playwright is perfect for automating complex sites and building robust scrapers. Paired with a cloud scraping platform, it can power full-scale data collection projects.
To dive deeper into Playwright, check out:
- Playwright Documentation – Official API reference and guides.
- Playwright Examples – Demos showing various Playwright features.
- Apify Academy – Free web scraping courses for beginners to experts.
Happy scraping!