Skip to content

How Javascript is Used to Block Web Scrapers? In-Depth Guide

Javascript based browser fingerprinting has become one of the most powerful tools for identifying and blocking web scrapers. Unlike IP blocking or captcha checks, fingerprinting allows sites to execute arbitrary scraping detection code directly within the user‘s browser.

In this comprehensive 3000+ word guide, we‘ll do a deep dive into modern Javascript fingerprinting techniques used to detect scrapers. We‘ll explore common methods used to extract unique browser signatures, and how you can camouflage your scraper to avoid triggering anti-bot systems.

The Rising Threat of Browser Fingerprinting

Over the past 5 years browser fingerprinting has exploded in popularity. According to a recent report by ScrapingHub, over 50% of websites now use advanced client-side fingerprinting techniques to identify scrapers and bots.

This is likely because Javascript fingerprinting provides immense power to detect scrapers with minimal false positives. By leveraging thousands of unique browser signals, sites can construct highly accurate digital fingerprints to track browsers across visits.

Some key statistics on the rise of browser fingerprinting:

  • 15x increase in sites using fingerprinting since 2016 [source: ScrapeHero]
  • 90%+ of scraping bots detected via fingerprinting by sites using sophisticated anti-bot services [source: ScrapingHub]
  • 3000+ signals can be used to uniquely identify browsers [source: EFF]

The effectiveness of fingerprinting comes down to the sheer amount of useful identity signals exposed by the browser:

Example Fingerprinting Signals

  • Screen resolution
  • Installed system fonts
  • GPU/Hardware specs
  • Browser window dimensions
  • Timezone
  • Language
  • Audio setup
  • CPU performance benchmarks
  • Canvas rendering artifacts
  • WebGL renderer strings
  • DOM rendering performance

And thousands more exotic signals.

With over 3000 trackable attributes, even basic statistical models can reliably identify browsers with minimal false positives. This makes Javascript fingerprinting one of the most powerful options for detecting scrapers.

How Browser Fingerprinting Works

Now that we understand the scale of browser fingerprinting, let‘s look at how sites technically extract all these identity signals using Javascript:

1. Fingerprint Probe Script

Websites serve fingerprinting code hidden within normal scripts sent to visitors. This includes probes that attempt to extract identity signals from the browser:

// Example fingerprinting probe 
const fingerprint = {

  userAgent: window.navigator.userAgent,

  screenSize: {
    width: screen.width, 
    height: screen.height
  },

  gpu: {
    vendor: gl.renderer.VENDOR,
    renderer: gl.renderer.RENDERER
  },

  installedFonts: getInstalledFonts(),

  benchmark: performance.now() // CPU perfbenchmark

  // ... etc
}

These probes extract identifiable information like screen resolution, GPU specs, fonts, benchmarks etc.

2. Generate Browser Signature

The site combines probed fingerprint data to generate a unique ID hash for the browser:

const browserFingerprint = generateFingerprintId(fingerprintData) 

// example browserFingerprint:
// 7y893y4h9fh399hf83hf9h3f...  

This hash acts as a persistent identifier for the browser. Advanced techniques even use ML to analyze thousands of signals.

3. Track Browser Across Visits

The site can now track browsers across visits by re-calculating the fingerprint hash each time and comparing values:

Visit 1: 
  browserFingerprint = 7y893y4h9fh399hf83hf9h3f

Visit 2:
  browserFingerprint = 7y893y4h9fh399hf83hf9h3f // Match!

-> Browser recognized, flag as potential scraper  

Matching fingerprint hashes allow sites to identify scrapers making too many automated requests.

Why Javascript Fingerprinting is So Powerful

What makes Javascript fingerprinting uniquely powerful compared to other anti-bot techniques is:

  • Hard to evade – Unlike IP checks, fingerprinting analyzes thousands of unique browser signals making it very hard to spoof.
  • Works instantly – Sites can extract fingerprint in <500ms allowing instant scraper identification.
  • Runs client-side – No need to correlate server logs, all analysis happens in the browser.
  • Difficult to detect – Fingerprinting code looks like any normal script.
  • Low false positives – Very high accuracy identifying scrapers due to browser signal diversity.

For these reasons, Javascript fingerprinting has emerged as one of the most potent anti-scraping tools allowing even inexperienced sites to detect advanced scrapers with high accuracy.

Patching Browser Automation Leaks

Now that we understand how powerful Javascript fingerprinting is, let‘s discuss techniques we can use to avoid detection when building browser based scrapers.

The first critical step is fixing dead giveaways that immediately identify our scraper browser as an automation bot rather than a real user. These leaks allow sites to flag scrapers instantly on page load.

Fixing navigator.webdriver

One notorious browser leak is the navigator.webdriver flag:

// Real browser 
navigator.webdriver // false

// Selenium browser
navigator.webdriver // true ! 

Browser automation tools like Selenium and Puppeteer cause navigator.webdriver to be exposed. This instantly signals to the page that it‘s running in an automated browser.

We can patch it by overriding the value:

// Fix navigator.webdriver leak
Object.defineProperty(navigator, ‘webdriver‘, {
  get: () => false, 
})

Now navigator.webdriver will always return false, masking the leak.

Here is how to apply it in Puppeteer:

// Puppeteer fix 
const puppeteer = require(‘puppeteer‘);

const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.evaluateOnNewDocument(() => {

  Object.defineProperty(navigator, ‘webdriver‘, {
    get: () => false,
  });

});

This covers the navigator.webdriver leak in Puppeteer allowing us to avoid instant flagging.

Fixing Chrome Runtime Leaks

Another common issue is leaks from the Chrome runtime. For example:

// Normal Chrome browser
chrome.runtime == undefined

// Puppeteer browser 
chrome.runtime.id // exposed!

Puppeteer and ChromeDriver expose their underlying Chrome runtime which contains telltale automation objects like chrome.runtime.

We can prevent access to the Chrome runtime altogether:

// Patch chrome runtime leak
Object.defineProperty(window, ‘chrome‘, {
  get: () => {
    // Prevent chrome runtime access
    return undefined
  }
}) 

This patches Puppeteer browsers to mask the chrome object completely.

Prevent Headless Flagging

In addition to patching specific leaks, we want to generally mask signs that the browser is running headlessly:

  • Set realistic navigator.languages and navigator.platform for OS
  • Fake realistic timezones
  • Use plugins/mimeTypes from real Chrome installs
  • Set navigator.webdriver to undefined instead of false
  • Limit window outer dimensions to < screen size

Tools like puppeteer-extra-plugin-stealth automate these fixes for Puppeteer. They are essential starting points when running headless browsers.

By patching leaks and masking headless signatures, our scraper will avoid being flagged instantly as an automation bot on page load. However, there are still thousands of fingerprintable factors so we need additional evasion techniques.

Resisting Browser Fingerprinting

Once we‘ve patched obvious leaks, the next step is modifying other aspects of our scraper‘s fingerprint to appear more natural:

Use Common Browser Profiles

Most desktop users are on Windows 10 and macOS Big Sur. Our scraper browser should mimic signals of these common platforms:

  • Use Windows 10/MacOS version for navigator.platform and navigator.oscpu
  • Apply corresponding browser configs for rendering, fonts, plugins etc.
  • Set viewport, screen resolution and browser chrome dimensions to common values

Matching a common OS fingerprint is essential to avoid standing out.

Mimic Real Browser Configurations

Our scraper should apply random configurations modeled after real browser data:

  • Set viewport to a common resolution like 1920×1080
  • Use a random but realistic timezone
  • Choose a locale from common options like en-US, en-GB, es-ES etc.
  • Use languages loaded from a real browser install
  • Pull other configurations like fonts and plugins from real browser data

The goal is to blend in with normal traffic as much as possible.

Introduce Calibrated Randomness

Certain fingerprint vectors like User-Agent and WebGL renderer can be randomized:

  • Rotate random common User-Agent strings
  • Generate unique WebGL renderer fingerprint each session
  • Apply slight randomness to time and performance benchmarks

The key is introducing just enough randomness to appear unique while still mimicking real browser data. Too much randomness is also suspicious.

Limit Identifier Persistence

Fingerprint vectors like WebGL renderer and canvas image digests can be reset:

  • Reset WebGL renderer string on each new page or after some time
  • Generate new canvas image fingerprint every few minutes

This prevents these volatile identifiers from being used to persistently track our scraper across a site.

Proxy Fingerprints

Regularly rotating residential and mobile proxies helps mask geographical patterns:

  • Rotate IPs frequently (e.g. every 30 mins)
  • Use proxies matching target site‘s geo-location
  • Never reuse the same proxy on a site

With enough proxies, the site cannot easily tie scraping activity to a persistent fingerprint ID.

Leveraging Scraping Services

Implementing robust browser evasion requires huge investment into engineering and maintaining scraping infrastructure.

Commercial scraping services like Scrapfly, ScraperAPI and ProxyCrawl handle evading detection internally so you can scrape undetected without the underlying complexity:

# Scrapfly python example 

from webscrapingsite import ScrapeConfig, ScrapflyClient

client = ScrapflyClient(api_key=‘XXX‘)

config = ScrapeConfig(
  url = ‘https://target-site.com‘,
  render_js = True, # Enable JS rendering
  asp = True # Enable anti-bot 
)

html = client.scrape(config).html

Benefits of using a paid scraping service include:

  • Works instantly – No need to build browser evasion infrastructure
  • Scales easily – Services handle scaling to any level of requests
  • Stays up to date – Fingerprint handling is constantly tuned as techniques evolve
  • Develop faster – Focus on value-add scraping logic instead of plumbing

For serious commercial scraping, offloading the heavy lifting to a service with existing scale and evasion expertise often makes sense.

Conclusion

Javascript based device fingerprinting has rapidly emerged as one of the most effective options for identifying scrapers and bots. By leveraging the thousands of distinct signals exposed by browsers, even amateur sites can now detect advanced scraping bots with high accuracy.

Thankfully, with enough engineering investment, scrapers can avoid triggering anti-bot services by:

  • Patching common automation leaks
  • Mimicking normal browser configurations
  • Introducing calibrated randomness
  • Frequently rotating proxies

However, implementing fingerprint evasion at scale is extremely complex. For teams focused on commercial scraping, leveraging established scraping providers can help sidestep evasion challenges allowing your engineers to focus on value-add scraping logic and data extraction.

Join the conversation

Your email address will not be published. Required fields are marked *