Skip to content

A JavaScript Developer‘s Comprehensive Guide to Web Scraping with Curl

If you‘re a JavaScript developer looking to extract data from websites, you‘ve likely heard of curl. This powerful command-line tool allows you to make HTTP requests and is a popular choice for testing APIs and downloading web pages. But did you know you can also use curl for web scraping by integrating it with Node.js?

In this in-depth guide, we‘ll explore multiple ways to leverage curl in your JavaScript web scraping workflows. Whether you want to execute curl commands in a Node.js child process, convert curl syntax to native JavaScript code, or access the full functionality of libcurl, this article has you covered. Let‘s dive in!

Why Use Curl for Web Scraping?

Before we get to the code, let‘s discuss why you might choose curl over other tools for web scraping as a JavaScript developer. Here are a few key benefits:

  • Curl is pre-installed on most systems, so there‘s nothing extra to download
  • It supports a wide range of protocols, from HTTP to FTP to IMAP
  • Curl offers granular control over requests, including custom headers and authentication
  • The command-line interface is simple yet flexible with dozens of options
  • It‘s a lightweight alternative to headless browsers for basic scraping jobs

These features make curl a worthy addition to any web scraper‘s toolkit. Now let‘s look at how to actually use it in Node.js.

Executing Curl Commands in Node.js

The simplest way to use curl with Node.js is to spawn a child process to execute curl commands. You can do this using the built-in child_process module:

const { exec } = require(‘child_process‘);

exec(‘curl https://example.com‘, (err, stdout, stderr) => {  
  if (err) {
    console.error(`exec error: ${err}`);
    return;
  }

  console.log(`Response: ${stdout}`);
});

Here the curl command is passed as a string to exec(), which spawns a shell and runs the command. The response body is available in stdout, while any errors are passed to the callback or written to stderr.

This approach works for basic GET requests, but what about more complex cases like sending POST data or handling redirects? Curl‘s extensive options cover almost any scenario:

const command = ‘curl -L -X POST -d "key1=value1&key2=value2" -H "Content-Type: application/x-www-form-urlencoded" https://httpbin.org/post‘;

exec(command, (err, stdout, stderr) => {
  if (err) { 
    console.error(err);
    return;
  }

  const response = JSON.parse(stdout);
  console.log(response.form);
});

The -L flag tells curl to follow redirects, -X POST sets the HTTP method, -d sends URL-encoded data, and -H adds a custom header. By building up the command string, you can fine-tune the request to your needs.

Converting Curl to JavaScript Code

Another option is to convert curl commands to native Node.js code using the https or http modules. To demonstrate, let‘s recreate the POST request from the previous example:

const https = require(‘https‘);

const data = new TextEncoder().encode(
  JSON.stringify({
    key1: ‘value1‘,
    key2: ‘value2‘,
  })
);

const options = {
  hostname: ‘httpbin.org‘,
  port: 443,
  path: ‘/post‘,
  method: ‘POST‘,  
  headers: {
    ‘Content-Type‘: ‘application/json‘,
    ‘Content-Length‘: data.length,
  },
};

const req = https.request(options, (res) => {
  console.log(`statusCode: ${res.statusCode}`);

  res.on(‘data‘, (chunk) => {
    process.stdout.write(chunk);
  });
});

req.on(‘error‘, (error) => {
  console.error(error);
});

req.write(data);
req.end();

This code sends the same POST request as the curl command by configuring the https.request() options and passing the form data in the request body. The tradeoff is more verbose syntax compared to curl.

If you find yourself frequently converting curl commands, tools like the Curl Converter can generate Node.js code automatically.

Accessing the Full Power of Libcurl

For the most control over your curl-powered web scraping, consider using the node-libcurl package. This lets you tap into the full feature set of the underlying libcurl library, including support for proxies, cookies, and multi-threading.

After installing node-libcurl, you can make a basic GET request like this:

const { Curl } = require(‘node-libcurl‘);

const curl = new Curl();
curl.setOpt(‘URL‘, ‘https://www.example.com‘);
curl.setOpt(‘FOLLOWLOCATION‘, true);

curl.on(‘end‘, (statusCode, body) => {  
  console.log(body);
  curl.close();
});

curl.on(‘error‘, curl.close.bind(curl));
curl.perform();

The setOpt function lets you configure libcurl options like the URL, while the on function registers callbacks for events like receiving the response body. The perform function executes the request.

For a POST request, you can use the POSTFIELDS option to set the request body:

curl.setOpt(‘URL‘, ‘https://httpbin.org/post‘);
curl.setOpt(‘POST‘, true);
curl.setOpt(‘POSTFIELDS‘, JSON.stringify({ key1: ‘value1‘ }));  
curl.setOpt(‘HTTPHEADER‘, [‘Content-Type: application/json‘]);

With access to the full libcurl API, you can implement more advanced web scraping functionality. For instance, you can set the PROXY option to scrape websites through an HTTP proxy:

curl.setOpt(‘PROXY‘, ‘http://127.0.0.1:1080‘);  

Or impersonate different browsers by customizing the User-Agent header:

curl.setOpt(‘USERAGENT‘, ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36‘);

Refer to the node-libcurl docs for the full list of supported options.

Parsing HTML with Cheerio

Once you‘ve fetched an HTML document using curl, the next step is to extract structured data from it. The simplest way to do this is with the Cheerio package, which provides jQuery-like syntax for navigating and manipulating the DOM.

For example, here‘s how you can scrape Google search results:

const curl = new Curl();
curl.setOpt(‘URL‘, ‘https://www.google.com/search?q=web+scraping‘);

curl.on(‘end‘, (statusCode, body) => {
  const $ = cheerio.load(body);

  const links = [];
  $(‘.yuRUbf > a‘).each((i, link) => {
    links.push({
      text: $(link).text(),
      href: $(link).attr(‘href‘),
    });
  });

  console.log(links);
  curl.close();
});

curl.on(‘error‘, curl.close.bind(curl));
curl.perform();

After loading the HTML into Cheerio, we can use a CSS selector to find the search result links and map over them to extract the link text and URLs. You can use the full power of CSS selectors to drill down to specific page elements.

When Curl Is Not Enough

While curl is great for extracting data from static HTML pages, it has limitations. If you need to scrape single-page apps that heavily rely on JavaScript to render content, curl will only see the initial HTML payload. Dynamic content loaded via XHR or fetch requests will be invisible.

For scraping JavaScript-heavy websites like social media sites or e-commerce stores, you‘ll need a tool that can execute JavaScript and wait for the page to fully render. Headless browsers like Puppeteer are perfect for this:

const puppeteer = require(‘puppeteer‘);

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto(‘https://developer.chrome.com/‘);

  // Type into search box  
  await page.type(‘.search-box__input‘, ‘automate beyond recorder‘);

  // Wait for suggest overlay to appear and click "show all results".
  const allResultsSelector = ‘.devsite-suggest-all-results‘;
  await page.waitForSelector(allResultsSelector);
  await page.click(allResultsSelector);

  // Wait for the results page to load and display the results.  
  const resultsSelector = ‘.gsc-results .gs-title‘;
  await page.waitForSelector(resultsSelector);

  // Extract the results from the page.
  const links = await page.evaluate((resultsSelector) => {
    return [...document.querySelectorAll(resultsSelector)].map(anchor => {
      const title = anchor.textContent.split(‘|‘)[0].trim();
      return `${title} - ${anchor.href}`;
    });
  }, resultsSelector);

  // Print all the files.
  console.log(links.join(‘\n‘));

  await browser.close();
})();

Here Puppeteer launches a headless Chrome browser, navigates to a URL, types into a search box, waits for the search overlay to appear, clicks a button to display all results, and finally extracts the search result links from the rendered HTML.

This is just scratching the surface of what‘s possible with Puppeteer. Check out the docs for more examples and guidance.

Avoiding the Scraping Arms Race

Web scraping is a bit of an arms race with websites deploying a myriad of countermeasures like rate limiting, IP blocking, User-Agent fingerprinting, and CAPTCHAs. This often leads to diminishing returns as you spend more and more time tweaking your scraper to avoid detection instead of extracting data.

If you prefer to focus on data rather than infrastructure, consider using a dedicated web scraping API like ScrapingBee. For a small fee, you can offload the low-level scraping work and extract data from any website with a single API call:

const newArticlesCount = await fetch(‘https://app.scrapingbee.com/api/v1/‘, {
  method: ‘POST‘,
  body: JSON.stringify({
    ‘api_key‘: ‘YOUR_API_KEY‘,
    ‘url‘: ‘https://techcrunch.com/‘,
    ‘extract_rules‘: {
      ‘articles‘: {
        ‘selector‘: ‘.post-block‘,
        ‘type‘: ‘list‘,
        ‘output‘: {
          ‘title‘: ‘.post-block__title‘,
          ‘titleLink‘: {
            ‘selector‘: ‘.post-block__title‘,
            ‘attr‘: ‘href‘,  
          },
          ‘author‘: ‘.river-byline__authors‘,
          ‘postDate‘: {  
            ‘selector‘: ‘time.river-byline__time‘,
            ‘attr‘: ‘datetime‘,
          },
        }
      },
      ‘articlesCount‘: ‘return $('.post-block').length‘,
    },
    ‘js_scroll‘: ‘2500‘,
  }),
});

const articles = await newArticlesCount.json();
console.log(`Found ${articles.articlesCount} new articles:`);
console.log(articles.articles);

ScrapingBee handles proxies, retries, and JavaScript rendering behind the scenes so you can get the data in a predictable schema.

Wrapping Up

You now have a comprehensive overview of how to use curl for web scraping with Node.js. Whether you choose to execute raw curl commands, use a library like node-libcurl, or delegate the entire process to an API, the basic principles are the same.

The most important thing is to respect the websites you scrape and follow their robots.txt rules. Avoid hitting them too frequently and consider caching responses when possible.

I recommend experimenting with each of the methods covered in this guide to get a feel for their tradeoffs. In many cases, a simple curl command or axios request is all you need. But for larger scraping workloads, investing in a battle-tested tool like Puppeteer or ScrapingBee can save you time and headaches.

What creative ways have you used curl in your web scraping projects? Share your experiences in the comments below!

Join the conversation

Your email address will not be published. Required fields are marked *