Javascript based browser fingerprinting has become one of the most powerful tools for identifying and blocking web scrapers. Unlike IP blocking or captcha checks, fingerprinting allows sites to execute arbitrary scraping detection code directly within the user‘s browser.
In this comprehensive 3000+ word guide, we‘ll do a deep dive into modern Javascript fingerprinting techniques used to detect scrapers. We‘ll explore common methods used to extract unique browser signatures, and how you can camouflage your scraper to avoid triggering anti-bot systems.
The Rising Threat of Browser Fingerprinting
Over the past 5 years browser fingerprinting has exploded in popularity. According to a recent report by ScrapingHub, over 50% of websites now use advanced client-side fingerprinting techniques to identify scrapers and bots.
This is likely because Javascript fingerprinting provides immense power to detect scrapers with minimal false positives. By leveraging thousands of unique browser signals, sites can construct highly accurate digital fingerprints to track browsers across visits.
Some key statistics on the rise of browser fingerprinting:
- 15x increase in sites using fingerprinting since 2016 [source: ScrapeHero]
- 90%+ of scraping bots detected via fingerprinting by sites using sophisticated anti-bot services [source: ScrapingHub]
- 3000+ signals can be used to uniquely identify browsers [source: EFF]
The effectiveness of fingerprinting comes down to the sheer amount of useful identity signals exposed by the browser:
Example Fingerprinting Signals
- Screen resolution
- Installed system fonts
- GPU/Hardware specs
- Browser window dimensions
- Timezone
- Language
- Audio setup
- CPU performance benchmarks
- Canvas rendering artifacts
- WebGL renderer strings
- DOM rendering performance
And thousands more exotic signals.
With over 3000 trackable attributes, even basic statistical models can reliably identify browsers with minimal false positives. This makes Javascript fingerprinting one of the most powerful options for detecting scrapers.
How Browser Fingerprinting Works
Now that we understand the scale of browser fingerprinting, let‘s look at how sites technically extract all these identity signals using Javascript:
1. Fingerprint Probe Script
Websites serve fingerprinting code hidden within normal scripts sent to visitors. This includes probes that attempt to extract identity signals from the browser:
// Example fingerprinting probe
const fingerprint = {
userAgent: window.navigator.userAgent,
screenSize: {
width: screen.width,
height: screen.height
},
gpu: {
vendor: gl.renderer.VENDOR,
renderer: gl.renderer.RENDERER
},
installedFonts: getInstalledFonts(),
benchmark: performance.now() // CPU perfbenchmark
// ... etc
}
These probes extract identifiable information like screen resolution, GPU specs, fonts, benchmarks etc.
2. Generate Browser Signature
The site combines probed fingerprint data to generate a unique ID hash for the browser:
const browserFingerprint = generateFingerprintId(fingerprintData)
// example browserFingerprint:
// 7y893y4h9fh399hf83hf9h3f...
This hash acts as a persistent identifier for the browser. Advanced techniques even use ML to analyze thousands of signals.
3. Track Browser Across Visits
The site can now track browsers across visits by re-calculating the fingerprint hash each time and comparing values:
Visit 1:
browserFingerprint = 7y893y4h9fh399hf83hf9h3f
Visit 2:
browserFingerprint = 7y893y4h9fh399hf83hf9h3f // Match!
-> Browser recognized, flag as potential scraper
Matching fingerprint hashes allow sites to identify scrapers making too many automated requests.
Why Javascript Fingerprinting is So Powerful
What makes Javascript fingerprinting uniquely powerful compared to other anti-bot techniques is:
- Hard to evade – Unlike IP checks, fingerprinting analyzes thousands of unique browser signals making it very hard to spoof.
- Works instantly – Sites can extract fingerprint in <500ms allowing instant scraper identification.
- Runs client-side – No need to correlate server logs, all analysis happens in the browser.
- Difficult to detect – Fingerprinting code looks like any normal script.
- Low false positives – Very high accuracy identifying scrapers due to browser signal diversity.
For these reasons, Javascript fingerprinting has emerged as one of the most potent anti-scraping tools allowing even inexperienced sites to detect advanced scrapers with high accuracy.
Patching Browser Automation Leaks
Now that we understand how powerful Javascript fingerprinting is, let‘s discuss techniques we can use to avoid detection when building browser based scrapers.
The first critical step is fixing dead giveaways that immediately identify our scraper browser as an automation bot rather than a real user. These leaks allow sites to flag scrapers instantly on page load.
Fixing navigator.webdriver
One notorious browser leak is the navigator.webdriver
flag:
// Real browser
navigator.webdriver // false
// Selenium browser
navigator.webdriver // true !
Browser automation tools like Selenium and Puppeteer cause navigator.webdriver
to be exposed. This instantly signals to the page that it‘s running in an automated browser.
We can patch it by overriding the value:
// Fix navigator.webdriver leak
Object.defineProperty(navigator, ‘webdriver‘, {
get: () => false,
})
Now navigator.webdriver
will always return false
, masking the leak.
Here is how to apply it in Puppeteer:
// Puppeteer fix
const puppeteer = require(‘puppeteer‘);
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, ‘webdriver‘, {
get: () => false,
});
});
This covers the navigator.webdriver
leak in Puppeteer allowing us to avoid instant flagging.
Fixing Chrome Runtime Leaks
Another common issue is leaks from the Chrome runtime. For example:
// Normal Chrome browser
chrome.runtime == undefined
// Puppeteer browser
chrome.runtime.id // exposed!
Puppeteer and ChromeDriver expose their underlying Chrome runtime which contains telltale automation objects like chrome.runtime
.
We can prevent access to the Chrome runtime altogether:
// Patch chrome runtime leak
Object.defineProperty(window, ‘chrome‘, {
get: () => {
// Prevent chrome runtime access
return undefined
}
})
This patches Puppeteer browsers to mask the chrome
object completely.
Prevent Headless Flagging
In addition to patching specific leaks, we want to generally mask signs that the browser is running headlessly:
- Set realistic
navigator.languages
andnavigator.platform
for OS - Fake realistic timezones
- Use plugins/mimeTypes from real Chrome installs
- Set
navigator.webdriver
toundefined
instead of false - Limit window outer dimensions to < screen size
Tools like puppeteer-extra-plugin-stealth automate these fixes for Puppeteer. They are essential starting points when running headless browsers.
By patching leaks and masking headless signatures, our scraper will avoid being flagged instantly as an automation bot on page load. However, there are still thousands of fingerprintable factors so we need additional evasion techniques.
Resisting Browser Fingerprinting
Once we‘ve patched obvious leaks, the next step is modifying other aspects of our scraper‘s fingerprint to appear more natural:
Use Common Browser Profiles
Most desktop users are on Windows 10 and macOS Big Sur. Our scraper browser should mimic signals of these common platforms:
- Use Windows 10/MacOS version for
navigator.platform
andnavigator.oscpu
- Apply corresponding browser configs for rendering, fonts, plugins etc.
- Set viewport, screen resolution and browser chrome dimensions to common values
Matching a common OS fingerprint is essential to avoid standing out.
Mimic Real Browser Configurations
Our scraper should apply random configurations modeled after real browser data:
- Set viewport to a common resolution like 1920×1080
- Use a random but realistic timezone
- Choose a locale from common options like
en-US
,en-GB
,es-ES
etc. - Use languages loaded from a real browser install
- Pull other configurations like fonts and plugins from real browser data
The goal is to blend in with normal traffic as much as possible.
Introduce Calibrated Randomness
Certain fingerprint vectors like User-Agent and WebGL renderer can be randomized:
- Rotate random common User-Agent strings
- Generate unique WebGL renderer fingerprint each session
- Apply slight randomness to time and performance benchmarks
The key is introducing just enough randomness to appear unique while still mimicking real browser data. Too much randomness is also suspicious.
Limit Identifier Persistence
Fingerprint vectors like WebGL renderer and canvas image digests can be reset:
- Reset WebGL renderer string on each new page or after some time
- Generate new canvas image fingerprint every few minutes
This prevents these volatile identifiers from being used to persistently track our scraper across a site.
Proxy Fingerprints
Regularly rotating residential and mobile proxies helps mask geographical patterns:
- Rotate IPs frequently (e.g. every 30 mins)
- Use proxies matching target site‘s geo-location
- Never reuse the same proxy on a site
With enough proxies, the site cannot easily tie scraping activity to a persistent fingerprint ID.
Leveraging Scraping Services
Implementing robust browser evasion requires huge investment into engineering and maintaining scraping infrastructure.
Commercial scraping services like Scrapfly, ScraperAPI and ProxyCrawl handle evading detection internally so you can scrape undetected without the underlying complexity:
# Scrapfly python example
from webscrapingsite import ScrapeConfig, ScrapflyClient
client = ScrapflyClient(api_key=‘XXX‘)
config = ScrapeConfig(
url = ‘https://target-site.com‘,
render_js = True, # Enable JS rendering
asp = True # Enable anti-bot
)
html = client.scrape(config).html
Benefits of using a paid scraping service include:
- Works instantly – No need to build browser evasion infrastructure
- Scales easily – Services handle scaling to any level of requests
- Stays up to date – Fingerprint handling is constantly tuned as techniques evolve
- Develop faster – Focus on value-add scraping logic instead of plumbing
For serious commercial scraping, offloading the heavy lifting to a service with existing scale and evasion expertise often makes sense.
Conclusion
Javascript based device fingerprinting has rapidly emerged as one of the most effective options for identifying scrapers and bots. By leveraging the thousands of distinct signals exposed by browsers, even amateur sites can now detect advanced scraping bots with high accuracy.
Thankfully, with enough engineering investment, scrapers can avoid triggering anti-bot services by:
- Patching common automation leaks
- Mimicking normal browser configurations
- Introducing calibrated randomness
- Frequently rotating proxies
However, implementing fingerprint evasion at scale is extremely complex. For teams focused on commercial scraping, leveraging established scraping providers can help sidestep evasion challenges allowing your engineers to focus on value-add scraping logic and data extraction.