Skip to content

The Ultimate Guide to Web Scraping with Cheerio in Node.js

Web scraping is a powerful technique that allows you to extract data from websites. It has many useful applications, from collecting data for research to tracking prices for comparison shopping. While there are many ways to scrape web pages, using Node.js with the Cheerio library is a popular and effective method, especially for scraping static websites.

In this comprehensive tutorial, we‘ll cover everything you need to know to start web scraping using Cheerio in Node.js. We‘ll go over what Cheerio is, its advantages, and how to set it up. Then we‘ll walk through a complete example of scraping a website step-by-step. By the end, you‘ll be ready to scrape websites on your own using this powerful tool.

What is Cheerio?

Cheerio is a Node.js library that allows you to parse and manipulate HTML just like you would with jQuery in the browser. With Cheerio, you can load in HTML and then use familiar jQuery syntax and methods to select, traverse, and extract data from it.

The key thing to understand is that Cheerio does not actually render the HTML or execute any JavaScript on the page. It simply parses the HTML string and provides an interface to interact with it. This makes Cheerio very lightweight and fast compared to using a full browser environment.

Cheerio is perfect for web scraping when the data you need is contained directly in the page HTML. For dynamic sites that load data via JavaScript, you‘ll need a tool that can execute JS, like Puppeteer. But for simple static sites, Cheerio is a great choice.

Advantages of Cheerio for Web Scraping

There are a few key advantages that make Cheerio a top choice for many web scraping projects:

  • Familiarity: If you‘ve used jQuery before, you‘ll feel right at home with Cheerio‘s syntax. This makes it very approachable, especially for front-end devs.

  • Performance: Cheerio doesn‘t render the HTML or run JavaScript, making it very fast and efficient compared to headless browser tools. For scraping large amounts of pages, this performance boost is significant.

  • Flexibility: Cheerio can parse both HTML and XML, so it can scrape a wide variety of page types and structures. It provides all the power and flexibility of jQuery for manipulating the parsed page data.

  • Simplicity: Cheerio has one purpose – parsing and manipulating HTML/XML data. This focus makes it simple and easy to work with for scraping. You don‘t have to worry about the complexities of a rendering engine.

Of course, Cheerio isn‘t the right tool for every situation. If you need to scrape Single Page Apps, parse JavaScript-rendered content, or interact with complex user flows, you‘re better off with a tool like Puppeteer. But for many scraping tasks, Cheerio is the perfect balance of capability and simplicity.

Setting Up Your Cheerio Web Scraper

Before we dive into actually scraping a site, let‘s go over how to set up a new Node.js project and install Cheerio and its dependencies. You‘ll need to have Node.js installed on your machine first.

Create a new directory for the project and initialize it:

mkdir cheerio-scraper 
cd cheerio-scraper
npm init -y

Then install Cheerio and Axios, the promise-based HTTP client we‘ll use to fetch web pages:

npm install cheerio axios

We‘ll also be using the native Node.js fs module to save data to a file, but this is included in Node.js by default.

Now create a new file for the code:

touch scraper.js

Open this file in your code editor and add the following to the top to import the dependencies:

const axios = require(‘axios‘);
const cheerio = require(‘cheerio‘);
const fs = require(‘fs‘);

With the setup out of the way, we‘re ready to start scraping!

Scraping a Static Website with Cheerio

For this example, we‘ll be scraping books from the site books.toscrape.com. This site is purposely built for practicing web scraping and contains book listings with titles, prices, ratings, and more.

Our scraper will:

  1. Fetch the book listing page
  2. Extract the title, price, rating, and URL for each book
  3. Follow pagination links to scrape each page of listings
  4. Save the extracted data to a CSV file

Here‘s the code with explanations of each step:

const BASE_URL = ‘http://books.toscrape.com/‘;

const bookData = [];

async function scrapeBooks(url) {  
  try {
    // Fetch HTML of the page we want to scrape
    const { data } = await axios.get(url);
    // Load HTML we fetched in the previous line
    const $ = cheerio.load(data);
    // Select all the list items in plainlist class
    const listItems = $(‘.product_pod‘);

    // Iterate over each book list item
    listItems.each((idx, el) => { 
      // Select the title, price, rating, and url
      // Store in an object and push to the bookData array  
      bookData.push({ 
        title: $(el).find(‘h3 a‘).attr(‘title‘), 
        price: $(el).find(‘.price_color‘).text(), 
        rating: $(el).find(‘p.star-rating‘).attr(‘class‘).split(‘ ‘)[1],
        url: BASE_URL + $(el).find(‘h3 a‘).attr(‘href‘)
      });
    });

    // Recursively scrape the next page
    if ($(‘.pager .next a‘).length > 0) {
      let nextUrl = BASE_URL + $(‘.pager .next a‘).attr(‘href‘);
      scrapeBooks(nextUrl);
    } else {
      // Once all pages have been scraped, save the data to CSV
      exportResults();
      return;
    }
  } catch (err) {
    console.error(err);
  }
}

function exportResults() {  
  const csv = bookData.map(row => 
    Object.values(row)
  );
  csv.unshift(Object.keys(bookData[0]));

  fs.writeFileSync(‘./books.csv‘, csv.join(‘\n‘));
}

scrapeBooks(BASE_URL);

Let‘s go over the key parts in more detail.

Fetching and Parsing the HTML

First we fetch the HTML using Axios:

const { data } = await axios.get(url);

Then we load this HTML into Cheerio:

const $ = cheerio.load(data);

The $ variable is now our Cheerio instance that we can use to parse and extract data from the page HTML. It works just like jQuery.

Extracting the Book Data

Cheerio allows us to select elements from the page using familiar jQuery syntax. First we select all the book list items:

const listItems = $(‘.product_pod‘);

We can now iterate over each book and extract the title, price, rating, and URL:

listItems.each((idx, el) => {
  bookData.push({
    title: $(el).find(‘h3 a‘).attr(‘title‘),
    price: $(el).find(‘.price_color‘).text(),
    rating: $(el).find(‘p.star-rating‘).attr(‘class‘).split(‘ ‘)[1],  
    url: BASE_URL + $(el).find(‘h3 a‘).attr(‘href‘)
  });   
});

Paginating Through Listings

To scrape all books, not just the first page of listings, we need to check if there‘s a next page link and recursively call our scrapeBooks function:

if ($(‘.pager .next a‘).length > 0) { 
  let nextUrl = BASE_URL + $(‘.pager .next a‘).attr(‘href‘);
  scrapeBooks(nextUrl);
}

We select the next page link, extract its href attribute, and pass this URL to scrapeBooks again. This will continue until we reach the final page with no "next" link.

Saving Data to CSV

After all pages have been scraped, we convert our bookData array to CSV format and write it to a file:

function exportResults() {
  const csv = bookData.map(row => 
    Object.values(row)  
  );
  csv.unshift(Object.keys(bookData[0]));

  fs.writeFileSync(‘./books.csv‘, csv.join(‘\n‘));  
}

We use map to extract just the values from each book object, giving us an array of arrays. We add the keys as a header row using unshift. Finally, we join each row with a newline character and write to a books.csv file using fs.writeFileSync.

And with that, our Cheerio web scraper is complete! Running this script with Node.js will fetch all the book listings from books.toscrape.com, extract the key information, and save it to a CSV file.

Additional Tips and Best Practices

Here are a few more things to keep in mind when web scraping with Cheerio and Node.js:

  • Respect website terms of service and robots.txt. Always check if a site allows scraping before doing so.

  • Use proxies to avoid IP blocking when scraping large amounts of pages. Rotating proxies, like those offered by services such as Bright Data, IPRoyal, Proxy-Seller, SOAX, Smartproxy, Proxy-Cheap, and HydraProxy can help avoid detection.

  • Add delays between requests to avoid overwhelming servers. A few seconds is generally sufficient.

  • Handle errors gracefully. Use try/catch and log errors so you can debug issues.

  • Consider using a headless browser like Puppeteer if you need to scrape dynamic content. Cheerio is great for static HTML but can‘t execute JavaScript.

  • Regularly verify your scraper‘s selectors, as website structures can change and break your script. Using attributes less likely to change (like IDs and data attributes) is best.

Conclusion

Web scraping with Cheerio in Node.js is a powerful and efficient way to extract data from static web pages. Cheerio‘s familiar jQuery-like syntax and lightweight implementation make it easy to parse and manipulate HTML for scraping.

In this guide, we covered what Cheerio is, why it‘s great for web scraping, and walked through a complete example of scraping book listings from a static website. We also discussed some tips and best practices to keep in mind.

Armed with this knowledge, you‘re ready to start using Cheerio for your own web scraping projects. Of course, scraping is just the first step – there‘s much more to learn about storing, analyzing, and using the data you collect. But Cheerio and Node.js provide a solid foundation to build all sorts of useful web scraping applications.

Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *