The web is filled with endless amounts of valuable data – from stock prices to weather forecasts to real estate listings. Much of this data is open and accessible, if you know how to properly extract it.
That‘s where web scraping comes in.
Web scraping (also called data mining or web data extraction) is the process of gathering data from websites automatically through code. It allows you to harvest online data at scale for all kinds of useful purposes.
In this comprehensive guide, you‘ll learn:
-
Why web scraping with Node.js is so powerful – The key benefits of using JS for your web scraping projects
-
How to fetch any web page with Axios – Making GET requests for HTML using a handy Node module
-
How to parse HTML and extract data with Cheerio – Querying the DOM and scraping page content
-
Techniques for handling pagination, forms, APIs, and more – Moving beyond basic scraping to more complex sites
-
Guidelines for avoiding getting blocked – The precautions you need to scrape responsibly
I‘ll share code snippets and examples for each concept so you can quickly apply these techniques in the real world.
Let‘s dive in!
Why Node.js is Built for Web Scraping
Before we look at the tools and techniques, it‘s important to understand why Node.js has become such a popular platform for building scrapers.
Here are some key advantages JavaScript and Node provide:
Asynchronous – Node uses asynchronous, event-driven architecture. All the I/O in Node is non-blocking. This allows you to make multiple requests in parallel very efficiently.
Scalable – Node‘s event loop can handle thousands of concurrent connections with minimal resource usage. Scrapers built on Node scale very well across servers.
Fast – Node is very high-performance for I/O-bound tasks like web scraping. The asynchronous requests make it faster than threaded platforms.
Ecosystem – With NPM, Node has access to hundreds of useful libraries like Axios, Cheerio and Puppeteer for easy web scraping.
According to the State of JavaScript 2020 survey, Node.js usage has grown over 3X in the last 5 years. It now powers numerous production web scraping services processing millions of requests per day.
If that‘s not enough proof, a number of benchmark tests have shown Node scraping consistently outperforms Python and PHP for large-scale data harvesting.
So if you‘re looking build anything beyond a hobby project, Node is likely your best bet.
Meet Axios – The Ultra-Flexible HTTP Client
Making HTTP requests is obviously core to any web scraping project.
And for JavaScript, Axios is hands-down the most popular library for handling HTTP calls.
Here are just some of the reasons Axios is so highly-regarded:
-
Lightweight – At only 12kb minified, it‘s very browser-friendly
-
Isomorphic – Works on Node and in browsers, which is great for universal JS apps
-
Promise-based – Uses modern async/await instead of callbacks
-
Extensible – Very easy to modify requests and responses
-
Well-documented – Everything is clearly explained in the docs
According to NPM, Axios sees over 17 million weekly downloads, making it the most used JS library of its kind.
And using it for web scraping couldn‘t be easier.
Let‘s walk through a quick example to fetch a webpage and extract the HTML:
const axios = require(‘axios‘);
async function scrapePage() {
const response = await axios.get(‘https://example.com‘);
const html = response.data;
// Now we can pass the HTML to Cheerio etc...
}
scrapePage();
With Axios, pages are fetched effortlessly with minimal code.
You can also await
multiple requests simultaneously, allowing pages to load asynchronously:
const html1 = await axios.get(‘https://page1.com‘);
const html2 = await axios.get(‘https://page2.com‘);
This asynchronous benefit starts to really add up when scraping larger sites.
According to benchmarks by ScrapingBee, Axios can fetch over 13,000 URLs in a minute on average compared to just 400 with Requests – a 27X difference!
Of course, Axios has many more features we could discuss like custom headers, error handling, request cancellation, and response schema validation.
But for now, we have our HTML – it‘s time to start data extraction with Cheerio.
Cheerio – Fast HTML Scraping with jQuery Power
To be able to query and traverse our freshly scraped HTML documents, we‘ll need a DOM manipulation library.
For JavaScript, that library is Cheerio – essentially a stripped down version of jQuery designed specifically for use on the server.
According to NPM, Cheerio boasts over 4 million downloads a week, making it the most popular DOM toolkit for Node.
Let‘s first install it:
npm install cheerio
Now we can load HTML and start querying:
const cheerio = require(‘cheerio‘);
const $ = cheerio.load(html);
This loads the HTML into a Cheerio object that we can now manipulate.
Cheerio selectors work exactly like jQuery:
// Get all H1 tags
const h1s = $(‘h1‘);
// Get first paragraph
const para = $(‘p‘).first();
You can find elements, loop through nodes, access attributes like hrefs, and extract or modify text:
// Get text
const headingText = $(‘h1‘).text();
// Get attribute
const linkUrls = $(‘a‘).map((i, link) => $(link).attr(‘href‘));
// Modify DOM
$(‘p‘).append(‘<span>New text</span>‘);
As you can see, Cheerio allows you to quickly query, parse, and transform scraped HTML using very terse and familiar syntax.
A Real-World Cheerio Scraping Example
Let‘s see a real-world example of using Cheerio to scrape product data from an ecommerce page.
We‘ll fetch the HTML with Axios first:
const axios = require(‘axios‘);
const cheerio = require(‘cheerio‘);
const url = ‘https://www.example.com/products/widget123‘;
const { data: html } = await axios.get(url);
Now we can use Cheerio on the HTML:
// Load HTML
const $ = cheerio.load(html);
// Extract product data
const name = $(‘#productName‘).text();
const price = $(‘#productPrice‘).text();
const image = $(‘#productImg‘).attr(‘src‘);
const description = $(‘#productDescription‘).text();
// Output as object
const product = {
name,
price,
image,
description
};
console.log(product);
And there we have it – all the core product data extracted!
As you can imagine, you can use these exact same techniques to scrape practically any data from any site.
Cheerio + Axios Recipes for Common Scraping Tasks
Let‘s quickly summarize some useful code "recipes" for common scraping cases:
Fetch a page and get HTML
const html = await axios.get(url);
Load in Cheerio
const $ = cheerio.load(html);
Loop through elements
$(‘div.myClass‘).each((i, el) => {
// $(el) is the current node
});
Get text
const text = $(‘h1‘).text();
Get attribute
const href = $(‘a‘).attr(‘href‘);
Extract JSON data
const data = JSON.parse($(‘script‘).text());
This covers some of the most used patterns when pairing Axios and Cheerio for web scraping.
Now let‘s look at some more advanced topics.
Handling Pagination
Often you‘ll need to scrape dozens or hundreds of pages from a site.
Let‘s look at an example approach to handle pagination in a scalable way.
First we‘ll write a scrapePage
function:
const scrapePage = async (pageNum) => {
const url = `https://www.site.com/products?page=${pageNum}`;
const html = await axios.get(url);
// Scrape data from page here
// Check for next page
const nextPage = $(‘.pagination .nextLink‘);
if (nextPage.length) {
return scrapePage(pageNum + 1);
}
};
We dynamically build the page URL with the changing page number parameter.
After scraping each page, we check if the next page link exists – if so, we recursively call scrapePage
again to continue pagination.
This allows us to scrape thousands of pages with minimal effort!
To start the process, we just call:
scrapePage(1);
Recursive async patterns like this are immensely powerful when dealing with multi-page sites.
Submitting Forms and Logging In
At times, you‘ll need to submit forms to access data behind logins.
Axios has a flexible axios.post()
method that allows easy form submission:
const params = {
username: ‘myuser‘,
password: ‘mypass‘
};
await axios.post(‘/login‘, params);
To handle logins, you would follow these steps:
- Fetch the page with the login form
- Use Cheerio to extract the CSRF token
- Construct the params object based on the form fields
- Make a POST request to the form action URL
- Save cookies from the response
- Re-use cookies for any future requests
This allows you to automate any form-based logins and access gated data.
For example, Apify was able to use these techniques to build a scalable web scraper for LinkedIn capable of logging in and extracting user profiles behind the login wall.
Filling out Forms with Puppeteer
For trickier forms, an alternative is using Puppeteer – a Node library that provides a high-level API for controlling headless Chrome.
With Puppeteer, you can programatically fill out forms and click buttons just like a real user:
// Get login page
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(‘https://website.com/login‘);
// Enter credentials
await page.type(‘#username‘, ‘myuser‘);
await page.type(‘#password‘, ‘pass1234‘);
// Click submit
await page.click(‘.submit-btn‘);
// Site logs us in automatically
This allows you to automate pretty much any action a user can take manually in the browser.
The downside is that Puppeteer scripts run significantly slower than raw Axios requests.
So it‘s best to save headless browsers for special cases and stick with Axios for general crawling.
Conquering Client-Side JavaScript Sites
Increasingly, websites rely on client-side JavaScript to render content.
The data is embedded in scripts and rendered dynamically in the browser only after full page load.
For these modern sites, basic HTTP requests and DOM parsing won‘t reveal the underlying data.
To extract data from JavaScript web apps, you need to use a headless browser like Puppeteer to evaluate the scripts in a runtime environment.
For example, to scrape data from a React app:
// Launch headless Chrome
const browser = await puppeteer.launch();
// Create page
const page = await browser.newPage();
// Load URL
await page.goto(‘https://app.com‘);
// Wait for JavaScript to run and render HTML
await page.waitFor(2000); // Experiment with timing
// Get rendered HTML source
const html = await page.content();
// Parse with Cheerio as normal!
Now the rendered HTML will contain the data added by React, Angular etc that you can efficiently extract.
So in summary:
- Use Puppeteer to load the JS pages
- Wait for the page to fully render
- Extract generated HTML source
- Parse with Cheerio as you normally would!
This pattern opens up the entirety of the modern web to your scraper.
Avoiding Getting Blocked – 6 Pro Tips
When scraping at scale, you may encounter bot blocking measures like IP bans, blocked endpoints, or CAPTCHAs.
Here are six tips to help you avoid common pitfalls:
1. Check for red flags – Monitor for error responses (403, 429, etc) and CAPTCHAs to detect blocks.
2. Limit request rate – Add delays and don‘t send more than a few requests per second. Mimic organic browsing patterns.
3. Randomize user agents – Rotate a set of realistic user agents instead of using the default.
4. Use proxies – Route requests through multiple residential proxy IPs to distribute load.
5. retry blocked requests – Website blocks are often temporary. Retry later with a delay.
6. Analyze headers – Inspect response headers and browser fingerprints to understand a site‘s protections.
With proper care, most sites can be scraped responsibly without triggering bot defenses. Monitor closely and adjust your methods as needed.
For larger projects, commercial proxy services like BrightData can be worthwhile to hide your scraping activity through millions of IPs.
Final Thoughts
And that wraps up our complete guide to web scraping with Node.js, Axios and Cheerio!
We covered everything from core concepts to advanced techniques including:
- Benefits of using Node.js for web scraping
- Making fast requests with Axios
- Powerful DOM parsing with Cheerio
- Recipes for common scraping tasks
- Strategies for pagination, forms, logins, and JavaScript sites
- Guidelines for avoiding bot blocks
The web is filled with a goldmine of data if you know how to properly extract it.
I hope you now feel empowered to build scrapers for any purpose using the concepts and tools we discussed.
With a bit of practice, you‘ll be scraping like a pro in no time!
Let me know if you have any other questions. And happy scraping!