Websites often contain valuable data presented in HTML tables. Whether it‘s financial data, product information, sports statistics, or other structured datasets, being able to programmatically extract that tabular data opens up many possibilities for analysis, aggregation, and building new applications.
In this guide, we‘ll walk through how to scrape table data from a webpage using NodeJS. By the end, you‘ll have a working script that fetches a webpage, locates a table, extracts its contents, and saves the data in a structured format. Let‘s get started!
Setting Up the Project
First, create a new directory for your project and initialize a new NodeJS project in it:
mkdir table-scraper
cd table-scraper
npm init -y
Next, install the dependencies we‘ll need:
npm install axios cheerio
We‘ll use Axios to fetch the webpage HTML and Cheerio to parse and extract data from it. Cheerio provides a jQuery-like syntax for traversing and manipulating the HTML DOM.
Fetching the Webpage
Create a new file scraper.js
and add the following code to fetch the demo webpage containing a table we want to scrape:
const axios = require(‘axios‘);
const cheerio = require(‘cheerio‘);
const url = ‘https://demo.scrapingbee.com/table_content.html‘;
async function fetchPage(url) {
const response = await axios.get(url);
return cheerio.load(response.data);
}
fetchPage(url).then($ => {
// We‘ll parse the page here
});
This code fetches the HTML of the specified URL using Axios and passes the loaded Cheerio instance to the promise callback, where we‘ll do the actual scraping.
Parsing the Table
To extract the table data, we first need to locate the <table>
element on the page. Using your browser‘s developer tools, inspect the target table and find a suitable CSS selector for it. In this case, the table has a class .BasicTable-table
which we can use.
fetchPage(url).then($ => {
const table = $(‘.BasicTable-table‘);
// We‘ll extract data from the table here
});
Now that we have a reference to the table element, we can extract its data. To get the table headers, we‘ll find all <th>
elements in the table:
const headers = [];
table.find(‘th‘).each((i, el) => {
headers.push($(el).text().trim());
});
To get each table row, we‘ll find all <tr>
elements and map over them to extract the cell values:
const rows = table.find(‘tr‘).map((i, el) => {
const cells = $(el).find(‘td‘).map((j, cell) => $(cell).text().trim()).get();
return cells;
}).get();
We now have arrays of the table headers and row data. We can combine them into an array of objects where each object represents a row:
const tableData = rows.map(row => {
const rowObject = {};
row.forEach((cell, i) => {
rowObject[headers[i]] = cell;
});
return rowObject;
});
Saving the Data
Finally, let‘s save our extracted table data to a JSON file:
const fs = require(‘fs‘);
fs.writeFile(‘table-data.json‘, JSON.stringify(tableData, null, 2), err => {
if (err) {
console.error(‘Error writing file‘, err);
} else {
console.log(‘Successfully wrote file‘);
}
});
Here‘s the full code for our scraper.js
file:
const axios = require(‘axios‘);
const cheerio = require(‘cheerio‘);
const fs = require(‘fs‘);
const url = ‘https://demo.scrapingbee.com/table_content.html‘;
async function fetchPage(url) {
const response = await axios.get(url);
return cheerio.load(response.data);
}
fetchPage(url).then($ => {
const table = $(‘.BasicTable-table‘);
const headers = [];
table.find(‘th‘).each((i, el) => {
headers.push($(el).text().trim());
});
const rows = table.find(‘tr‘).map((i, el) => {
const cells = $(el).find(‘td‘).map((j, cell) => $(cell).text().trim()).get();
return cells;
}).get();
const tableData = rows.map(row => {
const rowObject = {};
row.forEach((cell, i) => {
rowObject[headers[i]] = cell;
});
return rowObject;
});
fs.writeFile(‘table-data.json‘, JSON.stringify(tableData, null, 2), err => {
if (err) {
console.error(‘Error writing file‘, err);
} else {
console.log(‘Successfully wrote file‘);
}
});
});
Run the script with node scraper.js
and it will save the extracted table data to table-data.json
. In this case, with the demo page, you should end up with an array of 100 objects representing NASDAQ stock prices:
[
{
"SYMBOL": "AMD",
"NAME": "Advanced Micro Devices Inc",
"PRICE": "94.82",
"CHANGE": "-3.98",
"%CHANGE": "-4.03"
},
...
]
Next Steps
This example provides a great starting point for extracting tabular data from webpages using NodeJS. To adapt it for your own projects, you‘ll need to swap out the URL and CSS selectors to match your target webpage.
Some additional considerations and techniques to explore:
-
Pagination: Many data tables are split across multiple pages. You‘ll need to find the pagination links and navigate through each page to get the full dataset. Watch out for URL patterns like
?page=1
,?p=1
, etc. -
Inconsistent Table Structures: Not all HTML tables follow a neat
<table>
/<th>
/<tr>
/<td>
structure. Be prepared to handle missing elements, colspan/rowspan attributes, and nested tables. -
Rate Limiting: Be considerate when scraping data and limit your request rate to avoid overwhelming servers. Consider adding delays between requests.
-
Rotating User Agents and IP Addresses: Some websites attempt to block web scrapers. Rotating your User Agent header and IP address can help circumvent those protections.
-
Robust Error Handling: Network requests and HTML parsing can fail for many reasons. Make sure to implement proper error handling, logging, and retrying failed requests.
-
Data Cleanup: Table cells often contain inconsistent formatting, extra whitespace, or combining characters. You may need to postprocess your extracted data with regular expressions or other string manipulation to get it into a consistent format.
Extracting data from websites can be a powerful technique for building new datasets and applications. With the NodeJS ecosystem and libraries like Axios and Cheerio, you can build scrapers to extract data from all kinds of webpages. Just be sure to respect website terms of service and don‘t abuse your scraping abilities.
Happy scraping!