Skip to content

Playwright Web Scraping Tutorial for 2024

Web scraping is the automated process of extracting data from websites. With the rise of dynamic websites built using JavaScript, traditional scraping tools like BeautifulSoup in Python can no longer access all the information. Playwright is a Node.js library that provides a solution for scraping modern web pages by automating browsers like Chromium, Firefox and WebKit.

In this comprehensive tutorial, we will cover the fundamentals of web scraping with Playwright step-by-step:

What is Playwright?

Playwright is a Node.js library developed by Microsoft for automating Chromium, Firefox and WebKit browsers via code. Some key features:

  • Supports JavaScript, Python, C# and Java for cross-platform web automation
  • Launches a real browser instance to handle dynamic websites, not just a headless browser
  • Allows interacting with elements, clicking buttons, filling forms, navigating between pages etc.
  • Enables cross-browser testing across Chromium, Firefox and WebKit with the same API
  • Provides a non-opinionated and flexible API compared to competitors

In addition to web testing, the browser automation capabilities make Playwright well-suited for web scraping modern JavaScript heavy sites.

Setup and Installation

To install Playwright, you will first need Node.js installed on your system.

Node.js Installation

Install Node.js 12.0+ from the official website. This will also install the node package manager npm which we‘ll use to install Playwright.

Next, create a new Node.js project folder and initialize npm:

mkdir playwright-scraper
cd playwright-scraper
npm init -y

This will generate a package.json file with default configurations for your project.

Install Playwright Node.js Package

Now install the Playwright npm package:

npm install playwright

This will download Playwright and its browser driver dependencies in a node_modules folder within your project.

Install Browser Drivers

The final step is to download the browser binaries that Playwright will control:

npx playwright install

This will install Chromium, Firefox and WebKit driver executables on your machine. Playwright is now ready to launch and control these browser instances programmatically!

Install Python Package

For Python, install the playwright package via pip:

pip install playwright

And install the browser drivers:

playwright install

We are now ready to write some code to scrape with Playwright!

Scraping Basics with Playwright

Let‘s look at a simple scraping script to fetch the title from a web page.

Node.js

Here is an async Playwright script to scrape a page title in Node.js:

const playwright = require(‘playwright‘);

(async () => {
  const browser = await playwright.chromium.launch();
  const page = await browser.newPage();
  await page.goto(‘https://www.example.com/‘);

  // Extract title
  const title = await page.title(); 
  console.log(title);

  await browser.close();
})();

The script launches a Chromium browser, opens a new page, navigates to a URL, extracts the page title using page.title(), prints it, and closes the browser.

Python

Equivalent Python script:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
  browser = p.chromium.launch()
  page = browser.new_page()
  page.goto("https://www.example.com/")

  title = page.title()
  print(title)

  browser.close()

Playwright provides a synchronous API in Python to simplify scripts.

The key steps are:

  1. Launch new browser instance
  2. Open a new page
  3. Navigate page to target URL
  4. Use page methods like title() to extract data
  5. Close browser when done

This allows our scripts to automate actions performed manually in a browser like navigating between pages, clicking elements etc.

Locating Page Elements

To extract data from a page, we first need to locate the HTML elements that contain our target data.

Playwright offers two options to find elements:

CSS Selectors

CSS selectors use a syntax like element#id, element.class to match elements with a specific ID or class name.

For example, to locate all <div> elements with class product:

const products = await page.$$(‘div.product‘); 

XPath Selectors

XPath is a query language for selecting nodes in XML documents. Playwright supports XPath selectors like:

const products = await page.$x(‘//div[@class="product"]‘);

Once we have selected the elements, we can extract data or interact with them in the browser.

Extracting Text

After identifying elements on a page, we can extract text or attributes from them in Playwright.

page.textContent()

The textContent() method returns the full inner text of an element.

For example, to extract the title of a single product element:

const title = await product.textContent(); 

page.innerText()

innerText() reads the rendered text shown to the user, ignoring hidden elements.

page.inputValue()

The inputValue() method returns the entered value for form input fields like textboxes.

page.getAttribute()

We can get attribute values from elements like href for links using getAttribute():

const link = await page.$(‘a.product-url‘);
const url = await link.getAttribute(‘href‘);

Extracting Multiple Elements

To extract data from multiple elements, we can use page.$$eval() to evaluate within the browser context:

const titles = await page.$$eval(‘div.product‘, elements => {
  return elements.map(item => item.innerText); 
});

The anonymous function filters the desired data from matched elements.

Downloading Files

Beyond extracting text, Playwright can also download files like PDFs, images, zip files etc.

Here is an example to download all PDFs from a page:

// Match all pdf links
const pdfUrls = await page.$$eval(‘a[href$=".pdf"]‘, links => {
  return links.map(a => a.href)
});

// Download each PDF 
for(const pdfUrl of pdfUrls) {
  const response = await page.request.get(pdfUrl);
  const buffer = await response.buffer(); 

  await fs.promises.writeFile(pdfUrl, buffer);
}

We first extract all PDF links then make a request to the URL and save the response buffer to a file.

The same approach works for images, zips, MP3s etc.

Scraping Images

Here is how to scrape and download all images from a page:

// Extract image urls 
const imgSrcs = await page.$$eval(‘img‘, imgs => {
  return imgs.map(img => img.src)
});

// Download each image
for(const src of imgSrcs) {
  const res = await page.request.get(src);
  const buffer = await res.buffer();

  await fs.promises.writeFile(src, buffer); 
}

We grab the image src attributes first, then make requests to download each one.

Handling Secure Sites

Many sites use SSL these days. Playwright will automatically handle SSL certificates for HTTPS domains.

If you need to ignore certificate errors, launch the browser with:

const browser = await chromium.launch({
  ignoreHTTPSErrors: true
});

To avoid sites blocking scrapers, it‘s recommended to use proxies and dummy credentials.

Configuring Proxies

Proxies are essential for web scraping to prevent blocks and bans. Here is how to configure proxies in Playwright:

const browser = await chromium.launch({
  proxy: {
    server: ‘http://192.168.1.1:8080‘,
    username: ‘user‘,
    password: ‘pass‘ 
  }
});

The proxy setup accepts a server URL, with optional username and password for authentication.

You can also configure proxies at the page level:

await page.route(‘**/*‘, route => {
  route.continue({
    proxy: {
      server: ‘http://192.168.1.1:8080‘
    }
  });
});

This proxies all page requests through the defined proxy server.

I recommend using residential proxies which rotate IP addresses to avoid getting blocked while scraping.

Scraping Best Practices

Here are some tips for avoiding detection and captchas while scraping:

  • Use proxies, proxy rotation is ideal
  • Add random delays between 1-5 seconds between page actions
  • Disable images and media to reduce footprint
  • Mimic real human behavior like scrolling, mouse movements etc
  • Close browser windows instead of reusing to avoid fingerprinting

Also monitor for HTTP 429 and 503 errors to detect if you are getting blocked.

Playwright vs Puppeteer vs Selenium

How does Playwright compare with other popular browser automation tools like Puppeteer and Selenium?

Puppeteer is a Node.js only library focused on Google Chrome and Headless Chrome. It provides fast and stable performance.

Selenium supports multiple languages and browsers but has slower performance due to browser driver overhead.

Playwright combines the capabilities of both. It has native language bindings for Node, Python, C# and Java. It also supports Firefox and WebKit in addition to Chromium. Performance is very fast and on par with Puppeteer.

In summary, Playwright offers the best of both worlds – support for multiple browsers and languages while still being very fast and lightweight.

Conclusion

Playwright is a powerful tool for robust web scraping due its browser automation capabilities and flexibility. This tutorial covered the basics of using Playwright to scrape text, images and files from websites programmatically.

Some key topics included:

  • Installing Playwright for Node.js and Python
  • Locating elements with CSS and XPath selectors
  • Extracting text, attributes and multimedia files
  • Handling secure sites and proxy configurations
  • Comparison to similar tools like Puppeteer and Selenium

I hope this provides a good foundation for you to start building scrapers with Playwright. The documentation and community resources are great places to learn more. Let me know if you have any other questions!

Tags:

Join the conversation

Your email address will not be published. Required fields are marked *