Skip to content

The Complete Guide to Using Proxies with Node-Fetch for Web Scraping

Node-fetch is one of the most popular JavaScript HTTP client libraries, with over 20 million downloads per week. It provides a simple way to make HTTP requests from Node.js applications, similar to the Fetch API available in web browsers.

One of the most common use cases for node-fetch is web scraping – the process of programmatically extracting data from websites. However, many websites employ various techniques to detect and block web scraping activity. Using proxies is one effective way to circumvent these anti-scraping measures.

In this guide, we‘ll take an in-depth look at how to use proxies with node-fetch for web scraping. We‘ll cover why proxies are necessary, how to configure node-fetch to use proxies, the different types of proxies available, and additional tips for scraping websites successfully. Let‘s dive in!

Why Use Proxies for Web Scraping?

Web scraping is often done at scale by sending a high volume of requests to a website in order to extract its data. Websites don‘t take kindly to this type of activity, as it can put strain on their servers and potentially allow competitors to access their data.

To defend against unwanted scraping, websites use various methods to detect scraper bots and block their IP addresses. Some common signs that give away scrapers include:

  • High request rate from a single IP
  • Lack of typical browser headers like User-Agent
  • Accessing pages in an unusual pattern
  • Not executing JavaScript

When a scraper‘s IP gets blocked, it can no longer access the site and extract data. This is where proxies come in. A proxy server acts as an intermediary that forwards requests from the scraper to the target website. The website then sees the request as coming from the proxy‘s IP address instead of the scraper‘s real IP.

By rotating through a pool of proxy IP addresses and sending requests through them, scrapers can avoid having any single IP get blocked and continue scraping. Next we‘ll look at how to actually use proxies with node-fetch.

Configuring Node-Fetch to Use a Proxy

Unfortunately, node-fetch does not have built-in support for proxies. However, by using an external library called https-proxy-agent, we can easily configure node-fetch to route requests through a proxy server.

Here‘s an example of how to use https-proxy-agent with node-fetch:

const fetch = require(‘node-fetch‘);
const HttpsProxyAgent = require(‘https-proxy-agent‘);

(async () => {
  const proxyAgent = new HttpsProxyAgent(‘http://user:pass@proxy_host:proxy_port‘);

  const response = await fetch(‘http://api.ipify.org?format=json‘, {
    agent: proxyAgent
  });

  const data = await response.json();
  console.log(data.ip); // Outputs the proxy server‘s IP
})();

Let‘s break this down step-by-step:

  1. We import the node-fetch and https-proxy-agent libraries. You‘ll need to install them with npm install node-fetch https-proxy-agent first.

  2. We create a new HttpsProxyAgent instance, passing in the full URL of the proxy server to use. This should include the protocol (http or https), credentials if required, the hostname, and port number.

  3. When calling fetch(), we pass an options object as the second argument. The agent property allows specifying a custom agent to handle the HTTP connection, which is where we pass the proxyAgent.

  4. After getting the response, we parse the JSON data and output the IP property, which will show the IP address of the proxy server that made the request.

By default, this will use the same proxy for all requests made by this fetch instance. To rotate proxies, you‘d need to create a new HttpsProxyAgent with a different proxy URL for each request.

Types of Proxies for Web Scraping

Not all proxies are created equal. There are a few main types of proxies used for web scraping, each with their own characteristics:

  • Public free proxies
  • Shared datacenter proxies
  • Dedicated datacenter proxies
  • Residential proxies
  • Mobile proxies

Free public proxies are the lowest quality. While they don‘t cost anything, they are often slow, unreliable, and can even steal your data. Avoid using free proxies for anything beyond light testing.

Datacenter proxies come from servers in commercial datacenters. They are very fast and cheap, but are easier for websites to detect and block. Shared datacenter proxies, where multiple users share the same IP, are the least reliable. Dedicated proxies perform better but still have high block rates.

Residential proxies originate from real home internet connections. They are much harder to detect as they look like real user traffic. The downside is they are more expensive and can be slower than datacenter proxies.

Mobile proxies route traffic through 3G/4G mobile connections. They are the most trusted by websites but also the most costly. Use mobile proxies for high-value targets.

Additional Web Scraping Tips

Using proxies is an important part of successful web scraping, but there are additional techniques you should employ as well:

  • Respect robots.txt: Check if the site allows scraping before you start. Some sites may even have an API that is easier to use than scraping.
  • Use real browser headers: Mimic the headers sent by a real web browser, including User-Agent, Accept-Language, and Referer. Don‘t use the default headers provided by your HTTP client.
  • Randomize access patterns: Don‘t crawl the site in a sequential order. Introduce random delays between requests and jumble the order of pages you visit.
  • Avoid honeypot traps: Some sites create hidden links that only scrapers will find. If you follow them, you expose yourself as a bot. Be careful about which links you choose to crawl.
  • Execute JavaScript: Many modern sites require JavaScript to render. Use a real browser like Puppeteer if you need to execute scripts.
  • Solve CAPTCHAs: CAPTCHAs are the ultimate anti-bot test. Look into CAPTCHA solving services if you encounter them.

Using a Web Scraping API

Building a robust web scraping tool is a lot of work. In addition to proxies, you need to set up browsers, maintain servers, solve CAPTCHAs, and much more. Doing it all yourself is costly and time-consuming.

An alternative is to use a web scraping API that handles all of this for you behind the scenes. You simply send a request specifying the URL you want to scrape and the API returns the page‘s HTML.

For example, the ScrapingBee API makes it easy to scrape any site with a single API call:

const fetch = require(‘node-fetch‘);

(async () => {  
  const response = await fetch(‘https://app.scrapingbee.com/api/v1‘, {
    method: ‘POST‘,
    headers: {
      ‘Content-Type‘: ‘application/json‘,
    },
    body: JSON.stringify({      
      api_key: ‘YOUR_API_KEY‘,
      url: ‘https://example.com‘,
      render_js: false, 
    }),
  });

  const html = await response.text();
  console.log(html);
})();

The ScrapingBee API handles proxies, browsers, CAPTCHAs, and JavaScript rendering for you. It‘s a great option if you want to scrape sites without building complex infrastructure yourself.

Conclusion

Web scraping is a powerful way to extract data from websites, but it comes with many challenges. Using proxies is essential for avoiding IP blocks when scraping at scale.

As we‘ve seen, node-fetch doesn‘t support proxies by itself, but adding proxy support is relatively straightforward with the https-proxy-agent library. When choosing a proxy provider, consider the different types of proxies and select one that fits your needs and budget.

Beyond proxies, there are many other techniques required for successful web scraping. Setting headers, randomizing patterns, handling CAPTCHAs, and executing JavaScript are all important.

If you want to avoid the hassle of managing all of this yourself, consider using a web scraping API. ScrapingBee provides a simple API for scraping websites without having to worry about the low-level details.

Armed with this knowledge, you‘re well on your way to scraping websites effectively using node-fetch and proxies. So get out there and liberate that data!

Join the conversation

Your email address will not be published. Required fields are marked *