Skip to content

Hello World

How to Scrape Tables with Cheerio: The Ultimate Guide

If you‘ve ever needed to extract data from HTML tables on a website, you know it can be a tedious and time-consuming task to copy and paste the data manually. Fortunately, there‘s a better way! With the help of the Cheerio library, you can easily scrape and parse table data using JavaScript and Node.js.

In this ultimate guide, we‘ll dive deep into scraping HTML tables with Cheerio. Whether you‘re new to web scraping or an experienced pro, you‘ll learn everything you need to know to extract data from even the most complex tables. Let‘s get started!

What is Cheerio?

Cheerio is a popular and powerful Node.js library that allows you to parse and manipulate HTML documents using a jQuery-like syntax. It provides a simple and intuitive way to navigate and extract data from HTML by leveraging CSS selectors.

Unlike browser automation tools like Puppeteer or Selenium, Cheerio doesn‘t actually render the HTML or execute any JavaScript on the page. Instead, it parses the raw HTML string and creates an in-memory DOM tree that you can traverse and manipulate. This makes Cheerio very fast and efficient for scraping tasks.

Loading HTML Content

Before we can start scraping tables, we need to load our HTML content into a Cheerio object. Here‘s how to do it:

const cheerio = require(‘cheerio‘);
const html = `
  <html>  
    <body>

      <div class="content">
        <p>This is a paragraph.</p>
      </div>
    </body>
  </html>
`;

const $ = cheerio.load(html);

In this example, we first import the cheerio module. Then we define our HTML content as a template literal string.

Finally, we call the cheerio.load() function, passing it our HTML string. This loads the HTML into a Cheerio object which gets assigned to the $ variable (by convention).

We can now use the $ object to navigate and extract data from our parsed HTML just like we would with jQuery in the browser.

Traversing & Manipulating Elements

Cheerio provides a full suite of methods for traversing and manipulating elements, much like jQuery. Here are a few key methods you‘ll use frequently:

  • find(selector) – Get the descendants of the current selection that match the given selector
  • parent() – Get the parent of the current selection
  • next() – Get the immediately following sibling of the current selection
  • prev() – Get the immediately preceding sibling of the current selection
  • first() – Reduce the current selection to the first element
  • last() – Reduce the current selection to the last element
  • eq(index) – Reduce the current selection to the element at the given index
  • text() – Get the combined text contents of the current selection
  • attr(name) – Get the value of an attribute for the first element in the selection

Here‘s a quick example to illustrate some of these methods:

$(‘h1‘).text(); // "Hello World" 
$(‘div‘).find(‘p‘).text(); // "This is a paragraph."
$(‘h1‘).parent().attr(‘class‘); // undefined

Scraping a Basic Table

Now let‘s see how we can apply these concepts to scraping an HTML table. Consider the following simple table markup:

<table>
  <tr>
    <th>Name</th>
    <th>Age</th> 
    <th>Occupation</th>
  </tr>
  <tr>
    <td>John Doe</td>
    <td>35</td>
    <td>Software Engineer</td>
  </tr>
  <tr>  
    <td>Jane Doe</td>
    <td>32</td>
    <td>Data Scientist</td>  
  </tr>
</table>

To scrape this table with Cheerio, we can use the following code:

const tableData = [];

$(‘table tr‘).each((index, element) => {
  const tds = $(element).find(‘td‘);
  const row = {};

  $(tds).each((i, element) => {
    row[`column${i}`] = $(element).text().trim();
  });

  if (Object.keys(row).length > 0) {
    tableData.push(row);
  }
});

console.log(tableData);

Let‘s break this down step-by-step:

  1. First, we initialize an empty tableData array to store our extracted table rows.

  2. Next, we use $(‘table tr‘) to select all the <tr> elements inside any <table> on the page. We iterate over these rows using Cheerio‘s each() method.

  3. For each <tr>, we find all its <td> children elements using $(element).find(‘td‘). We assign this collection of cells to the tds variable.

  4. We create an empty row object to represent the data for the current row.

  5. We iterate over the tds collection, again using each(). For each <td> cell, we extract its trimmed text content using $(element).text().trim(). We assign this text to the row object using a dynamic key like column0, column1, etc.

  6. After iterating through all the cells, we check if the row object has any keys. If so, that means it‘s a data row (not a header row), so we push it into our tableData array.

  7. Finally, we log out our tableData array which contains an object for each table row.

Here‘s what the output would look like:

[
  { column0: ‘John Doe‘, column1: ‘35‘, column2: ‘Software Engineer‘ },
  { column0: ‘Jane Doe‘, column1: ‘32‘, column2: ‘Data Scientist‘ }  
]

Handling Edge Cases

Of course, real-world tables are rarely as straightforward as the example above. Here are a few best practices to keep in mind when scraping tables with Cheerio:

  • Check if elements exist before extracting data from them to avoid errors. Use methods like has(), is() and length to verify presence.

  • Handle colspan and rowspan attributes which make cells span multiple rows or columns. You may need to adjust your logic to correctly associate data with rows/columns.

  • Be aware of nested tables inside <td> elements. You may need to recursively traverse and extract data from them.

  • Beware of inconsistent rows/columns where certain cells are missing. Ensure your code can handle these cases gracefully without breaking.

  • Look out for headers, footers and caption elements that may not follow the regular <th>/<td> structure. Treat them as special cases in your code if needed.

  • Anticipate variation in styling, formatting, whitespace, etc. Use helper functions for cleaning and transforming extracted text as needed.

Here‘s an example of defensively checking for existence before extracting cell data:

$(‘table tr‘).each((index, element) => {
  const tds = $(element).find(‘td‘);

  if (tds.length > 0) {
    const row = {};

    $(tds).each((i, element) => {
      const cell = $(element);

      if (cell.length > 0) { 
        row[`column${i}`] = cell.text().trim();
      }
    });

    tableData.push(row);
  }
});

Cheerio vs Browser Automation

You may be wondering how scraping tables with Cheerio compares to using browser automation tools like Puppeteer. Here‘s a quick rundown:

Advantages of Cheerio:

  • Faster since it doesn‘t require rendering the full page and executing JS
  • Simpler API that‘s easier to get started with, especially if you‘re familiar with jQuery
  • Cheaper since you don‘t need to run an actual browser, so less memory/CPU required

Advantages of Browser Automation:

  • Can scrape content that‘s dynamically loaded via JavaScript
  • Supports interacting with page elements like clicking, typing, etc.
  • Renders the page visually which can be helpful for debugging

In general, if the table data you need to scrape is available in the raw HTML source, Cheerio is the way to go. But if the data gets added dynamically after page load, you‘ll likely need to use a browser automation tool.

Additional Resources

There are several companion libraries that are often used alongside Cheerio for scraping tasks:

Be sure to check them out if you need additional capabilities beyond what plain Cheerio provides.

Conclusion

Hopefully this guide has given you a thorough understanding of how to scrape tables using Cheerio. With the techniques covered here, you should be able to extract data from virtually any HTML table.

As you put these concepts into practice, be sure to keep the legal and ethical aspects of web scraping in mind. Most websites have terms of service that govern if and how you‘re allowed to scrape them. Be a good citizen and respect these rules to avoid any trouble.

Now you have all the knowledge you need to become a table scraping pro with Cheerio. So what are you waiting for? Get out there and liberate that table data!

Join the conversation

Your email address will not be published. Required fields are marked *