Hey there!
So you built an awesome web scraper with Node.js, but sites keep blocking your requests? We‘ve all been there!
In this comprehensive 3000+ word guide, you‘ll learn how to leverage Node-Unblocker to evade blocks and scrape any site.
What is Node-Unblocker and Why Use It?
Node-Unblocker is an open-source Node.js module that allows you to run a proxy server with custom middlewares for modifying requests and responses.
Here are some of the key reasons why Node-Unblocker is so useful for web scraping:
-
Avoid IP Blocks – By routing your scraper through proxies, you can avoid getting IP banned due to too many requests from one IP address. This is critical for successful large-scale scraping.
-
Bypass Geographic Blocks – Many sites restrict content access based on location. With Node-Unblocker, you can proxy through IPs in specific regions to bypass these blocks.
-
Scrape JavaScript Sites – Node-Unblocker supports proxying websockets which allows scraping sites relying heavily on JavaScript. Many scrapers struggle with JS sites.
-
Modify Requests – Custom middlewares let you change request headers to mimic browsers, auto-handle login, encode characters and more.
-
Transform Responses – Reshape and process responses using the response middlewares before they reach your scraper code.
-
Lightweight and Fast – Node-Unblocker is built entirely in JavaScript so it‘s lightning fast and easy to integrate into your JS scraper.
According to surveys, over 60% of websites now employ anti-scraping measures like IP blocking, CAPTCHAs and bot detection. Node-Unblocker is your secret weapon for getting past these roadblocks.
Installing and Configuring Node-Unblocker
Let‘s first go over how to install and configure Node-Unblocker. We‘ll also create a sample proxy server to try it out.
Step 1: Install Node-Unblocker
Assuming you already have a Node.js project, install Node-Unblocker and Express:
npm install node-unblocker express
This will add both modules to your package.json
.
Step 2: Initialize Express Server
In your server file, initialize an Express app and Node-Unblocker instance:
// server.js
const express = require(‘express‘);
const Unblocker = require(‘unblocker‘);
const app = express();
const unblocker = Unblocker();
// Node-Unblocker is now ready to use!
Step 3: Add Proxy Route
To proxy requests, add a route that redirects through the Unblocker middleware:
app.get(‘/scrape‘, (req, res) => {
const url = req.query.url;
res.redirect(`/proxy/${url}`);
});
Now we can make requests to /scrape?url=http://example.com
and Node-Unblocker will proxy them.
Step 4: Start Server
Finally, start the proxy server:
const PORT = 3000;
app.listen(PORT, () => {
console.log(`Proxy server running on port ${PORT}`);
});
Our basic Node-Unblocker server is now up and running!
Next let‘s look at how we can leverage it in our web scrapers.
Using Node-Unblocker for Web Scraping
Here are some of the most common use cases for using Node-Unblocker to scrape sites:
Rotating Proxies
One of the biggest challenges in web scraping is avoiding getting blocked by the target site after making too many requests from one IP.
Node-Unblocker provides an easy way to implement a rotating proxy solution.
The steps are:
-
Get access to a pool of proxy servers – You can use a provider like Luminati or Oxylabs to get hundreds of proxy IPs.
-
Add the proxies to a list – For example:
const proxies = [
‘http://proxy1.com‘,
‘http://proxy2.com‘,
// etc
];
- Before each request, randomly select a proxy:
function getRandomProxy() {
return proxies[Math.floor(Math.random() * proxies.length)];
}
- Make the web scraping request through the proxy:
const proxy = getRandomProxy();
request(`${proxy}/targetUrl`);
By rotating proxies each request, you can scrape at scale without getting blocked. Pro tip: Use a proxy pool at least 10x larger than your requests per second rate.
Bypassing Geographic Blocks
Some websites restrict content access based on the visitor‘s geographic location.
For example, the site http://usanews.com only allows traffic from United States IPs. Using Node-Unblocker, we can easily bypass this restriction.
The steps are:
-
Obtain residential proxy IPs for your target region, for example US.
-
Add these region-specific proxies to Node-Unblocker.
-
Route your scraper‘s traffic through Node-Unblocker.
Now all requests will appear coming from the required region and access the geo-blocked content successfully!
This technique also works for simulating mobile traffic from a specific country which is useful for scraping region-targeted mobile apps.
Scraping JavaScript Websites
Modern sites rely heavily on JavaScript to render content. Conventional scrapers that only download HTML have difficulty parsing these interactive pages.
Thankfully, Node-Unblocker proxies websockets by default which allows scraping JS sites:
// Enable websocket proxying
unblocker.listen().on(‘upgrade‘, unblocker.onUpgrade);
// Make request and JS will execute
request(‘http://jsSite.com‘);
The site will load and run JS in a real browser-like manner allowing successful extraction of data.
However, this only works well for public JavaScript sites. For robust JS rendering, a tool like Puppeteer is recommended instead.
Applying Custom Request Middlewares
One of the most powerful features of Node-Unblocker are its custom middlewares. We can use these to modify both requests and responses.
Some examples of how request middlewares can help web scraping:
Rotate User-Agents
Many sites block scrapers that send the same User-Agent on every request. We can automatically rotate it:
// Randomly choose User-Agent
unblocker.use((req, res) => {
const userAgents = [‘UA1‘, ‘UA2‘, ‘UA3‘];
req.headers[‘User-Agent‘] = userAgents[Math.floor(Math.random() * userAgents.length)];
});
Now each request will have a different User-Agent preventing this blocking tactic.
Auto Login
For sites requiring login, we can append the auth credentials without changing our scraper code:
unblocker.use((req, res) => {
if (req.url.includes(‘mysite.com‘)) {
req.headers[‘Authorization‘] = ‘Bearer xxx‘;
}
});
Any requests to the site will automatically have the user logged in.
Encode Special Characters
Some sites block odd characters like emojis. We can run custom encoding on requests:
unblocker.use((req, res) => {
req.url = encodeURI(req.url);
// Encode headers, body etc
});
This allows our scraper to use special characters without getting blocked.
As you can see, the possibilities with request middlewares are endless!
Handling Responses
We can also transform response data using the response middlewares:
Parse and Extract Data
Rather than doing data extraction in our scraper, we can do it directly in the middleware:
unblocker.use((req, res) => {
const $ = cheerio.load(res.body);
res.data = $(‘.result‘).text();
});
Now our scraper will receive the extracted data directly saving code.
Filter Sensitive Data
Some sites return cookies, headers and other metadata we don‘t need. We can clean this up:
unblocker.use((req, res) => {
// Remove unnecessary cookies
res.headers[‘set-cookie‘] = [];
// Delete other unwanted headers
delete res.headers[‘x-rate-limit‘];
});
This gives us only the useful response data.
Cache Common Requests
For sites with frequently accessed endpoints, we can build a cache to avoid hitting rate limits:
// In-memory cache
const cache = {};
unblocker.use((req, res) => {
if (cache[req.url]) {
return cache[req.url];
}
cache[req.url] = res;
});
Now repeated requests will be served from the cache directly.
As you can see, the response middlewares are extremely powerful for processing data right inside Node-Unblocker before it reaches your scraper.
Node-Unblocker vs Other Proxies
Node-Unblocker provides a lightweight in-process proxy for Node.js scrapers. However, there are also dedicated proxy services available. Let‘s compare the pros and cons:
Node-Unblocker
-
Pros
- Lightweight and fast
- Customizable middleware
- Integrates directly into Node scraper
-
Cons
- Need to manage own proxies
- Limited capabilities
- Not optimized for scale
Luminati
-
Pros
- Huge proxy pool
- Advanced proxy manager
- Made for web scraping
-
Cons
- Overkill for smaller scrapers
- Separate tool to integrate
Smartproxy
-
Pros
- Affordable proxy plans
- Dedicated IPs available
- Integrates via REST API
-
Cons
- Need separate account
- Limited customization
For large scale production scraping, a commercial proxy service like Luminati or Smartproxy is highly recommended. They handle proxy management and make integration easy via APIs.
For small to medium scrapers, Node-Unblocker offers a great in-process option. The ability to customize it as needed makes it really powerful.
Common Issues and How to Fix Them
When using Node-Unblocker, here are some common issues you may run into and how to troubleshoot them:
Site blocking Node-Unblocker IP
This can happen if you use the same Node-Unblocker server for too many requests. The solution is to frequently rotate your upstream proxy IPs that feed into Node-Unblocker.
Websockets not working
Ensure that you have unblocker.listen().on(‘upgrade‘, unblocker.onUpgrade)
in your code to enable websocket proxying.
Too many open file handles
Node-Unblocker can hit the open file limit when handling thousands of requests. Increase the max open files in Linux or use a reverse proxy like Nginx for better socket handling.
Errors when scraping sites
Add the debug
middleware to Node-Unblocker to log all requests. This helps identify what exact request is failing.
High memory usage
By default, Node-Unblocker buffers response bodies into memory which can cause spikes. Use streaming or disable buffering if needed.
scrape() is slow
Node-Unblocker is not optimized for ultra high throughput. Use a dedicated proxy service like Smartproxy if you need to maximize speed.
Middleware execution order
Keep in mind middleware execution order – For example, response middleware executes before request middleware on the way back.
Properly configuring Node-Unblocker takes some trial and error. Refer to the docs for advanced configuration options.
Deploying Node-Unblocker at Scale
To run Node-Unblocker in production, you need to properly host it on servers designed for high loads.
Here is one recommended architecture:
It consists of the following:
-
Node-Unblocker App Servers – These contain the main proxy app logic. For high loads, use at least 2-4 servers.
-
Reverse Proxy (Nginx) – Fronts the Node-Unblocker fleet and balances load across them. Also handles SSL and other edge routing logic.
-
Database – To store any persisted app data like caches, stats, etc. Redis works well.
-
Upstream Proxy Servers – The external proxy IPs that feed traffic into Node-Unblocker. Use at least 50-100+ proxies here.
-
Cloud Hosting – Use a provider like AWS or GCP to manage the servers, load balancing, failover and scalability.
Properly architecting a Node-Unblocker deployment can support 100,000+ requests per day without issues. Make sure to stress test the system at scale before launch.
For even larger loads, utilize a dedicated proxy service like Oxylabs which can handle millions of requests easily through their global proxy infrastructure.
Best Practices for Productive Web Scraping
Here are some general tips for maximizing success when web scraping through Node-Unblocker:
-
Use Random Time Intervals – Scrape sites at random intervals, not a fixed constant pace. This helps avoid traffic patterns that might trigger blocks.
-
Limit Requests Per IP – Restrict Node-Unblocker requests per upstream proxy IP to a reasonable limit like 5 RPM to avoid burning IPs.
-
Match Target Geography – Use proxy IPs that originate from the same region as your target site‘s audience. This helps avoid geo-based blocks.
-
Debug with Logging – Implement request logging so you can identify and reproduce errors easily.
-
Learn from Blocks – When you do get blocked, study the exact blocking approach used and tweak your strategy to avoid it in the future.
-
Regularly Rotate Servers – Rotate your proxy servers and infrastructure every few months to refresh all external-facing IPs.
-
Utilize Proxy Services – Maintaining your own proxy infrastructure is complex. Leverage an enterprise proxy service instead for reliability.
Web scraping can definitely be challenging. But by intelligently leveraging tools like Node-Unblocker and following best practices, you can extract data from virtually any site successfully.
Key Takeaways
Here are the key things we covered in this comprehensive Node-Unblocker web scraping guide:
- Node-Unblocker provides an in-process proxy server to route web scraping requests through
- It allows implementing critical features like proxy rotation, custom middlewares, and websocket support
- Properly configuring Node-Unblocker takes trial and error – use debugging to identify issues
- For large scale scraping, a dedicated proxy service like Luminati or Smartproxy is recommended
- Following web scraping best practices helps avoid blocks and extract data reliably
Node-Unblocker is a versatile tool that gives you more control over proxying logic compared to external services. Integrating it directly into your scrapers unlocks next-level possibilities.
I hope this guide helped demystify Node-Unblocker and how it can help you successfully scrape and scale extraction of data from any website! Let me know if you have any other questions.
Happy (unblocked) scraping!