Web Scraping with JavaScript and Node.js

Web scraping is the process of extracting data from websites automatically. With the rise of dynamic web content and client-side rendering, JavaScript has become an essential tool for scraping the modern web. In this comprehensive guide, we‘ll explore how to scrape websites using JavaScript and Node.js.

Why Use JavaScript for Web Scraping?

Here are some of the key advantages of using JavaScript for web scraping:

Runs natively in the browser: JavaScript can access the DOM directly and render JavaScript-driven sites. No separate browser automation needed.
Asynchronous: JavaScript is non-blocking and asynchronous by default, allowing for concurrent requests and fast parallel scraping.
npm ecosystem: Access to hundreds of useful scraping packages on npm.
Popular language: JavaScript skills are common among web developers. Lower barrier to entry than languages like Python.
Node.js: Allows running JavaScript code outside of the browser for scalability.
Headless browsers: Tools like Puppeteer and Playwright allow headless browser automation for JavaScript rendering.
Works across platforms: Runs on Windows, Mac, Linux and more.

Overall, JavaScript provides the tools needed to scrape modern JavaScript-heavy sites at scale. The vibrant ecosystem makes it a flexible and powerful choice.

Scraping Basics with Node.js

Before we get into specific libraries, let‘s walk through the fundamentals of web scraping using core Node.js modules.

HTTP Requests

The http and https modules in Node.js allow making HTTP/HTTPS requests to download web pages.

Here‘s an example fetching the Google homepage:

const https = require(‘https‘);

https.get(‘https://www.google.com‘, (res) => {
  let data = ‘‘;

  res.on(‘data‘, (chunk) => {
    data += chunk;
  });

  res.on(‘end‘, () => {
    console.log(data); 
  });

}).on(‘error‘, (err) => {
  console.log(‘Error: ‘, err.message);
});

This makes a GET request to the URL and registers callbacks to accumulate the response data chunks into a single variable.

The same can be done with http.get() for non-SSL URLs.

However, this simple approach has some limitations:

Only GET requests, no POST, PUT etc.
No way to add headers, cookies, or other options.
Handling binary data/buffers requires more work.

That‘s where the request module can help…

request – Simplified HTTP Requests

The request module provides an easier way to make HTTP calls in Node.js with more options:

const request = require(‘request‘);

request(‘https://www.google.com‘, (error, response, body) => {
  if (error) {
    console.error(‘Failed: ‘, error);
  } else {
    console.log(‘Response status code:‘, response.statusCode); 
    console.log(body);
  }
});

Now we can easily:

Make any HTTP request – GET, POST, PUT, DELETE, etc.
Add headers – User-Agent, Referrer, Content-Type etc.
Send form data, JSON and other types of request bodies
Handle binary data like images without extra effort
Redirects, retries, gzip and more automatically handled

Overall, request provides an easy way to download web pages programatically.

Parsing HTML

After downloading the raw HTML, we need to parse it to extract the data we want. This can be done with regular expressions, but a better way is to use a DOM parser like cheerio which implements jQuery selectors for easy DOM querying and manipulation.

Here‘s an example parsing a page and extracting all image src attributes:

const request = require(‘request‘);
const cheerio = require(‘cheerio‘);

request(‘https://example.com‘, (err, resp, body) => {
  if (!err && resp.statusCode === 200) {
    const $ = cheerio.load(body);

    const images = $(‘img‘).map((i, el) => $(el).attr(‘src‘)).get();

    console.log(images);
  }
});

Cheerio makes it really easy to extract data from HTML using CSS selectors and jQuery methods like text(), attr(), html() etc.

Some key things you can do:

Query elements by ID, class, tag name etc.
Traverse the DOM to find related elements.
Extract text, hrefs and other attributes
Build a list by iterating over elements.

This covers the basics of downloading and parsing pages with Node core modules. Next let‘s look at some more advanced scraping packages.

Axios – Promise Based HTTP Client

Axios is a popular HTTP client for node and the browser that provides some nice improvements over native requests:

Promise API
Automatic transforms for JSON data
Concurrency and cancellation
Form and URL-encoded data support
Powerful configuration defaults
Wide browser support

Here is how making a request with Axios compares:

const axios = require(‘axios‘);

axios.get(‘https://example.com‘)
  .then(response => {
    console.log(response.data);
  })
  .catch(error => {
    console.log(error);
  });

The promise-based approach allows using async/await for concise asynchronous code:

async function getPageHTML() {
  try {
    const response = await axios.get(‘https://example.com‘);
    return response.data; 
  } catch (error) {
    console.error(error);
  }
}

getPageHTML()
  .then(html => {
    // parse HTML
  });

Axios simplifies many aspects of HTTP requests and integrates seamlessly with promises.

cheerio – jQuery for the Server

While cheerio is great for basic parsing, there are times you may want a more fully featured jQuery-like library to scrape pages server-side.

This is where cheerio-httpcli comes in. It combines the cheerio and axios packages to allow jQuery selectors against remote web pages:

const http = require(‘cheerio-httpcli‘);

http.fetch(‘https://example.com‘)
  .then(result => {
    const $ = result.cheerio;

    const pageTitle = $(‘title‘).text(); 
  });

With cheerio-httpcli you can directly query elements on remote pages using the same selectors and methods as regular cheerio.

Some examples:

// Get page title
const title = await http.fetch(url).then($ => $.title());

// Extract all image urls
const imgUrls = await http.fetch(url)
  .then($ => $.find(‘img‘).map((_, el) => $(el).attr(‘src‘)).get());

// Submit a search form
const resultsPage = await http.fetch(‘https://example.com‘, {
  method: ‘POST‘,
  form: {
    searchQuery: ‘javascript‘ 
  }
});

This helps avoid nesting callbacks when querying remote pages. The httpcli instance also provides methods like .load(url) and .post(url, data) to further simplify requests.

Scraping JavaScript Pages

A problem with the methods so far is they don‘t execute JavaScript, so dynamically loaded content will be missing.

To scrape JavaScript-rendered pages, we need to use a headless browser like Puppeteer or Playwright to render the full page before parsing.

Here‘s an example with Puppeteer:

const puppeteer = require(‘puppeteer‘);

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto(‘https://example.com‘); 

  // Wait for JavaScript to render
  await page.waitForSelector(‘h1‘);

  // Get rendered HTML
  const html = await page.content();

  // Parse HTML and extract data
  // ...

  await browser.close();
})();

This launches a headless Chrome instance, loads the page, waits for a selector to appear, then retrieves the rendered HTML which can then be parsed.

Playwright works very similarly with a few extra benefits:

Support for Firefox and Webkit as well as Chromium.
Automatic waiting for certain events like network idle.
Shared browser context between pages.

Here is the same operation with Playwright:

const { chromium } = require(‘playwright‘); 

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto(‘https://example.com‘);

  // await page.waitForSelector(‘h1‘); // Not needed!

  const html = await page.content();

  await browser.close();
})();

Thanks to automatic wait configuration, we avoid needing to add most waits manually.

Both libraries also support evaluating JavaScript in the browser context directly:

// Puppeteer
const data = await page.evaluate(() => {
  // Run JS in page context 
});

// Playwright
const data = await page.evaluate(() => {
  // Run JS in page context
});

This allows scraping data directly instead of through the HTML.

In summary, headless browsers like Puppeteer and Playwright are essential for scraping JavaScript-rendered sites. They also enable automating interactions like clicking buttons and filling forms.

Honorable Mentions

Here are some other useful Node.js modules for scraping:

Bottleneck – Rate limiter for throttling requests
web-scraping – Downloads pages handling encoding and binary data.
scrape-it – Scrapes based on CSS selectors with caching support.
node-crawler – Web crawler/spider framework for Node.js.
node-fetch – Alternative HTTP request module with modern interface.

Putting It All Together

Now that we‘ve covered the key techniques and packages, let‘s walk through a complete web scraping script for a real world example.

We‘ll scrape HNTracker – a site for viewing historical Hacker News stats.

The goal is to extract the top monthly posters on the site.

Outline

Fetch the All Time Top users page
Extract links to each user‘s stats page
Iterate through the user links asynchronously
For each user, extract the top monthly posts count
Save results to a JSON file

Implementation

First we‘ll pull in the packages needed – axios for requests, cheerio for parsing, and Bottleneck to limit concurrency:

const axios = require(‘axios‘);
const cheerio = require(‘cheerio‘);
const Bottleneck = require(‘bottleneck‘);

const limiter = new Bottleneck({ maxConcurrent: 5 });

Next, function to fetch and parse the main page:

async function getTopUsers() {
  const { data } = await axios.get(‘https://hnprofile.com/?page=alltime‘);

  const $ = cheerio.load(data);

  return $(‘#content table td.username a‘)
    .map((i, el) => $(el).attr(‘href‘))
    .get();
}

Helper to scrape user stats from link:

async function getUserStats(url) {
  const { data } = await axios.get(url);

  const $ = cheerio.load(data);

  const topMonth = $(‘#content table tr:contains("Top Month") td.sdata‘);

  return {
    name: $(‘#userinfo h1‘).text().trim(),
    topMonth: topMonth.text().trim()
  }; 
}

Finally, put it all together:

(async () => {
  const userLinks = await getTopUsers();

  const results = [];

  await Promise.all(
    userLinks.map(link => 
      limiter.submit(async () => {
        const stats = await getUserStats(link);
        results.push(stats);
      })
    )
  );

  console.log(results);
})();

The full code makes the requests concurrently while limiting them to 5 at a time.

After running, we have extracted a list of the top all time posters and their best months!

The techniques covered enable building robust web scrapers with Node.js and JavaScript.

Conclusion

Here are some key takeaways:

JavaScript is well suited for modern web scraping thanks to its async capabilities and headless browser support.
Libraries like Axios, Cheerio and Bottleneck make it easy to handle requests, parse HTML and manage concurrency.
Use Puppeteer or Playwright to render pages fully before parsing for JavaScript heavy sites.
Follow conventions for throttling, caching, user-agents and more to scrape responsibly.
With Node.js you can build fast and scalable scrapers on the platform you already know!

Web scraping in Node.js does have a learning curve compared to beginner friendly tools like Apify and Scrapy. However, the flexibility and performance gains make it worth the effort for advanced scraping projects.