Skip to content

How to Web Scrape with Cheerio and Node.js

Web scraping allows you to extract data from websites, which can be extremely useful for gathering datasets, monitoring prices, automating repetitive tasks, and more. With the Cheerio Node.js library, you can easily parse and extract data from any webpage using familiar jQuery-like syntax.

In this guide, I‘ll walk you through everything you need to know to start web scraping with Cheerio and Node.js. You‘ll learn how to set up a scraping project, load webpages, select elements to extract, handle pagination and infinite scroll, and output the scraped data to a structured format. I‘ll also share some tips and best practices I‘ve learned to help you scrape reliably and efficiently.

By the end of this article, you‘ll be able to scrape any static website and transform the raw HTML into clean, structured data ready for analysis and use in your applications. Let‘s get started!

What is Cheerio?

Cheerio is a Node.js library that allows you to parse and traverse HTML documents using syntax very similar to jQuery. It‘s an implementation of jQuery‘s core API designed specifically for web scraping.

With Cheerio, you can load HTML into a virtual DOM, and then use methods like find(), parent(), siblings(), etc to navigate the document and extract pieces of data. You can use standard CSS selector syntax as well as jQuery extensions like :has(), eq(), and contains() to precisely select the elements you want.

Here are a few examples of what you can do with Cheerio:

  • Extract all links from a webpage
  • Scrape an HTML table into a CSV or JSON
  • Get the author, title, and body of articles
  • Collect product data like name, price, description from ecommerce listings
  • Download images whose filenames match a certain pattern

Cheerio is quite fast for web scraping since it skips loading and rendering things like CSS and images. It just parses the raw HTML string into a document you can query and traverse with code.

Getting Started

Before you can start using Cheerio, you‘ll need to have Node.js and npm installed. If you don‘t already have them set up, you can download an installer from the official Node.js website.

Once you have Node.js ready, create a new directory for your web scraping project:

mkdir scraping-project
cd scraping-project

Inside the project directory, initialize a new Node.js project:

npm init -y

This will create a package.json file with some default configuration.

Next, install Cheerio and Axios, a popular HTTP client we‘ll use for downloading webpages:

npm install cheerio axios

Now you‘re ready to write your first script using Cheerio! Create a new file called index.js and open it in your code editor.

Loading and Parsing HTML

The first step to scraping a webpage with Cheerio is to load the HTML content into a virtual DOM. There are a few different ways you can get the HTML – by downloading it from a URL, reading from a local file, or using a hardcoded HTML string.

For this example, we‘ll use Axios to download the HTML from a Wikipedia article. Here‘s what the code looks like:

const axios = require(‘axios‘);
const cheerio = require(‘cheerio‘);

const url = ‘https://en.wikipedia.org/wiki/Web_scraping‘;

axios.get(url)
  .then(response => {
    const html = response.data;
    const $ = cheerio.load(html);

    // TODO: Scrape data here
  })
  .catch(console.error);

This code fetches the HTML content of the https://en.wikipedia.org/wiki/Web_scraping URL using axios.get(). When the request completes, it takes the HTML string from response.data and passes it to cheerio.load().

The cheerio.load() function parses the HTML and returns a cheerio instance that we assign to the $ variable. This $ object works similarly to jQuery, allowing you to select and manipulate elements from the parsed HTML document.

Selecting Elements to Extract

With the HTML loaded into Cheerio, you can now use CSS selectors to pick out the elements and attributes that you want to extract.

Cheerio selectors work the same as in jQuery, so you can use element names, IDs, classes, attributes, and pseudoclasses to precisely target elements. If you‘re already familiar with CSS selectors from web development, you‘ll feel right at home. If not, you can consult any CSS selector reference to learn the syntax.

To test out different selectors on a live webpage, you can open the browser developer tools and use the element inspector. Hovering over elements will show you their HTML tags, and clicking an element will display the CSS selectors matching it. You can also use the Console tab in the dev tools to run cheerio-like queries on the current page by typing $(‘selector‘).

Let‘s try selecting a few elements from the web scraping Wikipedia article.

To get the main title of the article:

const title = $(‘h1‘).text();
console.log(title);

This selects the first <h1> tag and extracts its text content. Simple!

To get the article summary at the top of the page:

const summary = $(‘.shortdescription‘).text();
console.log(summary);

This looks for an element with the class "shortdescription" and gets its text. Classes are a great way to select specific elements when there are no unique IDs.

To extract all the image URLs in the article:

const imageUrls = $(‘img‘).map((i, img) => $(img).attr(‘src‘)).get();
console.log(imageUrls);

This selects all <img> tags on the page, then uses cheerio‘s map() method to extract the "src" attribute from each one into an array. get() converts the cheerio object into a regular array.

To extract a list of items:

const techniques = [];

$(‘dt‘).each((i, el) => {
  techniques.push($(el).text());
});

console.log(techniques);  

This loops through all the <dt> elements (definition terms) and pushes their text strings into an array.

As you can see, Cheerio provides a lot of flexibility in how you locate and extract pieces of data from the DOM. You can adapt these techniques to scrape whatever elements you need.

In addition to directly selecting elements, you can also navigate between elements based on their relative positions in the DOM tree. This is useful for drilling into deeply nested structures.

Here are a few of the most common traversal methods:

  • find(selector) – get descendant elements matching the selector
  • parent() – get the parent element
  • closest(selector) – get the closest ancestor matching selector
  • siblings() – get sibling elements
  • prev() – get the previous element
  • next() – get the next element
  • first() – get the first element
  • last() – get the last element

For example, to get the link URL inside the first table cell:

const url = $(‘td‘).first().find(‘a‘).attr(‘href‘);

Or to get the text of the next paragraph after an h2:

$(‘h2‘).each((i, el) => {
  const title = $(el).text();
  const nextParagraph = $(el).next(‘p‘).text();
  console.log(title + ":", nextParagraph);
});

By chaining these traversal methods together, you can precisely target elements based on their context, without needing to know the exact markup structure. This helps make scrapers more robust.

Scraping Multiple Pages

Many websites divide their content across multiple pages for pagination. To scrape all the data, you‘ll need to visit each page and combine the results.

The simplest approach is to construct the URL for each page and scrape them one at a time:

const baseUrl = ‘https://example.com/products?page=‘;

for(let page = 1; page <= 5; page++) {
  const url = baseUrl + page;

  axios.get(url)
    .then(response => {
      const html = response.data;
      const $ = cheerio.load(html);

      $(‘.product‘).each((i, el) => {
        // Extract data, push to array or database  
      });
    });
}

Some websites use AJAX to load additional content when the user scrolls to the bottom of the page (infinite scroll). For these, you‘ll need to use the browser developer tools to inspect the Network tab and replicate the API requests in your code.

Here‘s an example using Axios to fetch paginated data from an API endpoint:

const url = ‘https://example.com/api/products‘;

let currentPage = 1;
let totalPages = 1;

while(currentPage <= totalPages) {

  const params = {
    page: currentPage,
    page_size: 100  
  };

  axios.get(url, { params })
    .then(response => {
      totalPages = response.total_pages;
      currentPage++;

      response.data.forEach(item => {
        // Process each item  
      });
    });
}

The while loop will keep fetching the next page of products until it reaches the total number of pages. You can find the names of the pagination query parameters by inspecting the API requests in your browser‘s Network tab.

Outputting Scraped Data

The final step is to save the data you‘ve scraped into a structured format like CSV or JSON.

To write the results to a CSV file:

const createCsvWriter = require(‘csv-writer‘).createObjectCsvWriter;

const csvWriter = createCsvWriter({
  path: ‘output.csv‘,
  header: [
    {id: ‘name‘, title: ‘Name‘},
    {id: ‘price‘, title: ‘Price‘},
    {id: ‘url‘, title: ‘URL‘},
  ]
});

const records = [
  {
    name: ‘Product 1‘,
    price: ‘$10‘,
    url: ‘https://example.com/product1‘
  },
  {
    name: ‘Product 2‘,  
    price: ‘$20‘,
    url: ‘https://example.com/product2‘
  }
];

csvWriter.writeRecords(records)       
  .then(() => console.log(‘The CSV file was written successfully‘));

This uses the csv-writer package to define the CSV columns and rows. Install it first with npm install csv-writer.

To save the scraped data as a JSON file:

const fs = require(‘fs‘);

const results = [
  {
    title: "Example 1", 
    url: "https://example.com/1"
  },
  {  
    title: "Example 2",
    url: "https://example.com/2"  
  }
];

fs.writeFile(‘output.json‘, JSON.stringify(results), err => {
  if(err) {
    console.error(err);
    return;
  }

  console.log("Successfully wrote to output.json");
});

This converts the array of result objects to a JSON string using JSON.stringify() and writes it to an output.json file using the built-in fs module.

You can also save your scraped data to a database like MongoDB or MySQL for further analysis and querying. The process is similar – create an array of records and use an appropriate database library to bulk insert them.

Tips & Best Practices

Here are a few tips and best practices to keep in mind when web scraping with Cheerio:

  • Respect websites‘ terms of service and robots.txt files. Don‘t scrape websites that prohibit it.
  • Use a reasonable crawl rate and add delays between requests to avoid overloading servers. A few seconds is usually sufficient.
  • Set the User-Agent request header to identify your scraper. Some websites may block requests with blank/default user agents.
  • Handle errors and retries gracefully. Use try/catch and exponential backoff to deal with failed requests.
  • Cache the downloaded HTML and scraped data to avoid unnecessary requests.
  • Periodically check that your selectors are still working, as websites may change their markup. Log the scraped data and set up alerts if things look wrong.
  • Consider using a headless browser like Puppeteer if you need to scrape websites that heavily rely on JavaScript.

Limitations of Web Scraping

While web scraping is a powerful technique, it‘s important to be aware of its limitations and alternatives:

  • Scraping is brittle – even minor changes to a website‘s HTML can break your selectors.
  • Websites may deliberately block scraping, either through user agent detection, CAPTCHAs, or IP banning.
  • Scraped data may be inconsistent or low-quality compared to data collected via an API.
  • Large scale scraping may violate a website‘s terms of service or even be illegal in some cases.

If a website provides a public API, it‘s usually better to use that instead of scraping. APIs tend to be faster and more stable than HTML parsing. Some websites offer paid API access with higher rate limits.

For personal use, scraping is generally fine as long as you limit the rate of requests and don‘t republish the data. For commercial use, read the website‘s terms carefully and get permission if needed.

Conclusion

Web scraping with Cheerio and Node.js is a powerful way to extract data from websites that don‘t provide a convenient API. With a little bit of JavaScript knowledge and some familiarity with CSS selectors, you can collect large amounts of data from any public web page.

In this guide, we covered the basics of setting up a Cheerio project, loading HTML from URLs, selecting elements to extract, navigating the DOM, scraping multiple pages, and outputting the results to structured formats. We also went over some best practices for responsible scraping.

To learn more about Cheerio, consult the official docs and API reference. There are dozens of other methods and options not covered here. You can build some really complex and powerful scrapers by combining Cheerio with other Node.js libraries.

I encourage you to practice web scraping by finding some websites relevant to your interests and writing code to extract meaningful insights from them. Pretty soon you‘ll be able to quickly gather data on any topic from around the web!

Join the conversation

Your email address will not be published. Required fields are marked *