How to Scrape Tables with Cheerio: The Ultimate Guide
If you‘ve ever needed to extract data from HTML tables on a website, you know it can be a tedious and time-consuming task to copy and paste the data manually. Fortunately, there‘s a better way! With the help of the Cheerio library, you can easily scrape and parse table data using JavaScript and Node.js.
In this ultimate guide, we‘ll dive deep into scraping HTML tables with Cheerio. Whether you‘re new to web scraping or an experienced pro, you‘ll learn everything you need to know to extract data from even the most complex tables. Let‘s get started!
What is Cheerio?
Cheerio is a popular and powerful Node.js library that allows you to parse and manipulate HTML documents using a jQuery-like syntax. It provides a simple and intuitive way to navigate and extract data from HTML by leveraging CSS selectors.
Unlike browser automation tools like Puppeteer or Selenium, Cheerio doesn‘t actually render the HTML or execute any JavaScript on the page. Instead, it parses the raw HTML string and creates an in-memory DOM tree that you can traverse and manipulate. This makes Cheerio very fast and efficient for scraping tasks.
Loading HTML Content
Before we can start scraping tables, we need to load our HTML content into a Cheerio object. Here‘s how to do it:
const cheerio = require(‘cheerio‘);
const html = `
<html>
<body>
<div class="content">
<p>This is a paragraph.</p>
</div>
</body>
</html>
`;
const $ = cheerio.load(html);
In this example, we first import the cheerio
module. Then we define our HTML content as a template literal string.
Finally, we call the cheerio.load()
function, passing it our HTML string. This loads the HTML into a Cheerio object which gets assigned to the $
variable (by convention).
We can now use the $
object to navigate and extract data from our parsed HTML just like we would with jQuery in the browser.
Traversing & Manipulating Elements
Cheerio provides a full suite of methods for traversing and manipulating elements, much like jQuery. Here are a few key methods you‘ll use frequently:
find(selector)
– Get the descendants of the current selection that match the given selectorparent()
– Get the parent of the current selectionnext()
– Get the immediately following sibling of the current selectionprev()
– Get the immediately preceding sibling of the current selectionfirst()
– Reduce the current selection to the first elementlast()
– Reduce the current selection to the last elementeq(index)
– Reduce the current selection to the element at the given indextext()
– Get the combined text contents of the current selectionattr(name)
– Get the value of an attribute for the first element in the selection
Here‘s a quick example to illustrate some of these methods:
$(‘h1‘).text(); // "Hello World"
$(‘div‘).find(‘p‘).text(); // "This is a paragraph."
$(‘h1‘).parent().attr(‘class‘); // undefined
Scraping a Basic Table
Now let‘s see how we can apply these concepts to scraping an HTML table. Consider the following simple table markup:
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>Occupation</th>
</tr>
<tr>
<td>John Doe</td>
<td>35</td>
<td>Software Engineer</td>
</tr>
<tr>
<td>Jane Doe</td>
<td>32</td>
<td>Data Scientist</td>
</tr>
</table>
To scrape this table with Cheerio, we can use the following code:
const tableData = [];
$(‘table tr‘).each((index, element) => {
const tds = $(element).find(‘td‘);
const row = {};
$(tds).each((i, element) => {
row[`column${i}`] = $(element).text().trim();
});
if (Object.keys(row).length > 0) {
tableData.push(row);
}
});
console.log(tableData);
Let‘s break this down step-by-step:
-
First, we initialize an empty
tableData
array to store our extracted table rows. -
Next, we use
$(‘table tr‘)
to select all the<tr>
elements inside any<table>
on the page. We iterate over these rows using Cheerio‘seach()
method. -
For each
<tr>
, we find all its<td>
children elements using$(element).find(‘td‘)
. We assign this collection of cells to thetds
variable. -
We create an empty
row
object to represent the data for the current row. -
We iterate over the
tds
collection, again usingeach()
. For each<td>
cell, we extract its trimmed text content using$(element).text().trim()
. We assign this text to therow
object using a dynamic key likecolumn0
,column1
, etc. -
After iterating through all the cells, we check if the
row
object has any keys. If so, that means it‘s a data row (not a header row), so we push it into ourtableData
array. -
Finally, we log out our
tableData
array which contains an object for each table row.
Here‘s what the output would look like:
[
{ column0: ‘John Doe‘, column1: ‘35‘, column2: ‘Software Engineer‘ },
{ column0: ‘Jane Doe‘, column1: ‘32‘, column2: ‘Data Scientist‘ }
]
Handling Edge Cases
Of course, real-world tables are rarely as straightforward as the example above. Here are a few best practices to keep in mind when scraping tables with Cheerio:
-
Check if elements exist before extracting data from them to avoid errors. Use methods like
has()
,is()
andlength
to verify presence. -
Handle colspan and rowspan attributes which make cells span multiple rows or columns. You may need to adjust your logic to correctly associate data with rows/columns.
-
Be aware of nested tables inside
<td>
elements. You may need to recursively traverse and extract data from them. -
Beware of inconsistent rows/columns where certain cells are missing. Ensure your code can handle these cases gracefully without breaking.
-
Look out for headers, footers and caption elements that may not follow the regular
<th>
/<td>
structure. Treat them as special cases in your code if needed. -
Anticipate variation in styling, formatting, whitespace, etc. Use helper functions for cleaning and transforming extracted text as needed.
Here‘s an example of defensively checking for existence before extracting cell data:
$(‘table tr‘).each((index, element) => {
const tds = $(element).find(‘td‘);
if (tds.length > 0) {
const row = {};
$(tds).each((i, element) => {
const cell = $(element);
if (cell.length > 0) {
row[`column${i}`] = cell.text().trim();
}
});
tableData.push(row);
}
});
Cheerio vs Browser Automation
You may be wondering how scraping tables with Cheerio compares to using browser automation tools like Puppeteer. Here‘s a quick rundown:
Advantages of Cheerio:
- Faster since it doesn‘t require rendering the full page and executing JS
- Simpler API that‘s easier to get started with, especially if you‘re familiar with jQuery
- Cheaper since you don‘t need to run an actual browser, so less memory/CPU required
Advantages of Browser Automation:
- Can scrape content that‘s dynamically loaded via JavaScript
- Supports interacting with page elements like clicking, typing, etc.
- Renders the page visually which can be helpful for debugging
In general, if the table data you need to scrape is available in the raw HTML source, Cheerio is the way to go. But if the data gets added dynamically after page load, you‘ll likely need to use a browser automation tool.
Additional Resources
There are several companion libraries that are often used alongside Cheerio for scraping tasks:
- axios or node-fetch for making HTTP requests to fetch the HTML content
- tabletojson for directly converting HTML tables to JSON data
- cheerio-tableparser for parsing tables into a 2D array format
Be sure to check them out if you need additional capabilities beyond what plain Cheerio provides.
Conclusion
Hopefully this guide has given you a thorough understanding of how to scrape tables using Cheerio. With the techniques covered here, you should be able to extract data from virtually any HTML table.
As you put these concepts into practice, be sure to keep the legal and ethical aspects of web scraping in mind. Most websites have terms of service that govern if and how you‘re allowed to scrape them. Be a good citizen and respect these rules to avoid any trouble.
Now you have all the knowledge you need to become a table scraping pro with Cheerio. So what are you waiting for? Get out there and liberate that table data!