Skip to content

How TLS Fingerprint is Used to Block Web Scrapers?

Transport Layer Security (TLS) has become essential to internet security. It encrypts communication between our browsers and websites to protect sensitive user data in transit. However, TLS comes with an emerging downside as well – it can inadvertently reveal clues that identify your traffic as a web scraper.

In this guide, we‘ll explore how websites are analyzing TLS handshake data to detect and block scrapers. I‘ll walk you through how these TLS handshakes work, where they leak identifying information, and how to overcome these challenges.

Buckle up, and let‘s dive in!

How TLS Handshakes Establish Secure Connections

TLS is what makes secure HTTPS connections possible. Before any data is exchanged between a client (like your browser) and a server, they need to establish an encrypted tunnel for that data to flow through privately.

This is accomplished via a TLS handshake process. The handshake is like an introduction between the client and server where they negotiate how exactly they’ll encrypt data between each other.

Here is how a typical TLS handshake plays out:

  1. ClientHello: Client reaches out with a "hello" and shares supported TLS versions, cipher suites, and other configuration data.

  2. ServerHello: Server responds with its chosen TLS version, cipher suite, and configuration values.

  3. Certificate Exchange: Server provides its certificate to authenticate itself, client validates certificate.

  4. Key Exchange: Client and server exchange keys to be used for data encryption.

  5. Finished: Final confirmation messages exchanged to complete handshake.

After this process completes successfully, encrypted application data can start flowing between client and server!

This handshake happens behind the scenes every time you connect to a HTTPS website. For example, if you open https://example.com in Chrome:

  1. Chrome will send a ClientHello with its supported TLS versions, cipher suites, and extensions.

  2. The example.com server will choose a TLS version and cipher suite, and respond with a ServerHello.

  3. The server‘s certificate is validated by Chrome.

  4. Encryption keys are exchanged.

  5. A secure TLS connection is established!

Now Chrome and example.com can exchange encrypted data through the newly created TLS tunnel.

TLS ClientHellos Reveal Identifying Handshake Data

The ClientHello message sent by clients has historically been ignored. Servers typically don‘t care how a client wants to establish encryption, as long as they can agree on a secure configuration.

However, security researchers realized there are subtle differences in the ClientHello data across different HTTP clients and libraries:

  • The TLS version and cipher suites offered
  • The priority order of cipher suites
  • Enabled TLS extensions

For example, here is a snapshot of a ClientHello from the Python Requests library:

TLS Version: TLS 1.2

Cipher Suites:
  TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256  
  TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
  ...

Extensions: 
  server_name
  extended_master_secret
  ...

Now compare this to Chrome on Windows 10:

TLS Version: TLS 1.2  

Cipher Suites:
  TLS_AES_128_GCM_SHA256
  TLS_AES_256_GCM_SHA384
  TLS_CHACHA20_POLY1305_SHA256
  ...

Extensions:
  supported_versions
  server_name
  signed_certificate_timestamp
  ...  

You‘ll notice the TLS versions match, but the cipher suite order and enabled extensions differ.

User agents like Chrome aim to provide the most universally compatible ClientHello configuration. But languages and libraries tend to optimize handshakes for performance rather than compatibility.

These differences open the door to fingerprinting and identifying clients based on TLS handshakes.

JA3 Brings TLS Fingerprinting to Life

In 2017, researchers formalized a new fingerprinting technique called JA3 (Joint Authentication Analysis v3) to make TLS client identification easy and consistent.

JA3 defines a string format that captures the key varying ClientHello values:

JA3 Fingerprint String = 

  TLSVersion,

  CipherSuites,

  Extensions,

  EllipticCurves,

  EllipticCurveFormats

Each value is separated by a comma. For example, here is the JA3 fingerprint for the Requests library based on our handshake snapshot above:

771,49195-49199-52393-52392...,0-23-65281...,29-23-24,0 

And for Chrome:

771,4865-4866-4867...,0-11-10-16...,29-23-30-25-24,0

We can immediately see the differing cipher suites and enabled extensions between Requests and Chrome reflected in the JA3 fingerprints.

JA3 fingerprints create an easy and consistent way to lookup, analyze, and compare TLS handshakes.

To enable quick database lookups, the JA3 strings are usually hashed into a 32 character MD5 digest like:

Requests JA3 MD5: 
dbe0907495f5e986a232e2405a67bed1

Chrome JA3 MD5:
fe8290f5183e36d2bcc2bcbeb0f05221 

Having a fast fingerprint comparison mechanism opened the doors for TLS analysis at scale.

TLS Fingerprint Databases Identify Scrapers

With JA3, it became feasible to build massive databases of TLS fingerprint profiles grouped by client type and user agent.

Various commercial services and open source projects now maintain fingerprint databases in the millions to tens of millions scale.

For example, ja3er.com provides an open JA3 database with user agent breakdowns:

ja3er screenshot

They use crowd-sourced fingerprint submissions to create whitelist and blacklists:

  • Whitelists: Fingerprints of popular browsers and devices. These are allowed.

  • Blacklists: Known scraper fingerprints. These are blocked.

When your Python script or custom client connects to a website protected by these tools, it performs TLS analysis:

  1. The ClientHello data from your connection is parsed.

  2. A JA3 fingerprint string is constructed.

  3. This fingerprint is hashed and checked against their databases.

If your JA3 fingerprint is known to belong to a browser, the connection proceeds. But if it matches a known scraper fingerprint, the connection is blocked.

Targeted TLS Blocking Shuts Out Scrapers

With access to fingerprint databases, websites can easily filter scraper traffic without impacting real browsers.

For example, tools like Fingerprint Protector provide browser-like whitelists and targeted blacklists aimed at scraping clients:

Label Description Action
Browser Chrome, Firefox, Edge, Safari,… Allow
Scraper Python, Scrapy, Selenium,… Block
Unknown Unrecognized Fingerprint Challenge

If your Python Requests script connects, its fingerprint would match "Scraper" and be instantly blocked.

But some sites want to allow general Python access and only block specific abusive scrapers. So they target blacklist based on more specific attributes:

Label Description Action
Browser Chrome, Firefox, Edge, Safari,… Allow
MyCompetitorScraper JA3: dbe0907495f5e986… Block
Unknown Unrecognized Fingerprint Allow

Now only connections matching your competitor‘s fingerprint are selectively blocked, while general Python traffic proceeds.

As databases grow, TLS fingerprint blocking is only getting more precise.

Python TLS Handshakes Have Limited Configurability

Now that websites can easily identify scraper TLS handshakes, the next step is to make your clients look less suspicious during the handshake process.

Unfortunately, configuring TLS handshakes in Python is very limited:

The standard library and libraries like Requests use the system-installed OpenSSL library for TLS. This doesn‘t allow modifying the ClientHello extensions sent.

You only have control over:

  • TLS Version (e.g. TLS 1.2, 1.3)
  • Cipher Suites List

The extensions like supported_groups, key_share, etc cannot be changed.

This means Python TLS handshakes inherently look different from browsers that can fully configure extensions.

But you can still reduce your fingerprint surface area by controlling the versions and cipher suites.

Here is an example of changing the cipher suites in Requests to match Chrome‘s ordering:

import ssl
import requests
from urllib3.util.ssl_ import create_urllib3_context

# Use Chrome-like cipher suite  
cipher_suite = "TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256"

# Create SSL context to configure cipher  
ctx = create_urllib3_context(ciphers=cipher_suite)

# Send request through context
response = requests.get(url, ssl_context=ctx)

This makes your Python TLS handshake look less anomalous and avoids some blacklist blocking.

Go Provides More Control Over TLS Handshakes

Go provides significantly more control over TLS handshakes than Python.

Libraries like utls and cycletls allow crafting customized ClientHellos in Go with full modification of cipher suites, extensions, and other TLS options.

For example, with utls you could replicate a Chrome handshake:

import (
  "github.com/refraction-networking/utls"
)

// Craft ClientHello
hello := &utls.ClientHelloInfo{

  CipherSuites: []uint16{
    utls.TLS_AES_128_GCM_SHA256,
    utls.TLS_AES_256_GCM_SHA384, 
    // ...
  },

  Extensions: []utls.TLSExtension{

    &utls.UtlsExtendedMasterSecretExtension{}, 
    &utls.SCTExtension{},
    // ...  
  }, 
}

// Generate TLS config
config := utls.Config{ClientHelloInfo: hello}

// Create TLS connection with config 
conn := utls.Client(conn, config)

This allows modifying the ClientHello to appear like any browser or devicefingerprint.

Browser Automation Tools Have Genuine TLS Handshakes

When using browser automation tools like Puppeteer, Playwright, or Selenium, the TLS handshakes are performed by the actual browser‘s networking stack.

For example, if you are controlling Chrome with Puppeteer:

  • The ClientHello will use the real Chrome TLS configuration.
  • The JA3 fingerprint will match normal Chrome browsers.

This avoids the configurability drawbacks of languages like Python and Go for spoofing TLS handshakes. The browsers have full control built-in.

That being said, I recommend using a diverse set of browser versions and platforms when scraping through browser automation.

For example, rotate between:

  • Chrome on Windows 10
  • Chrome on macOS
  • Firefox on Linux
  • Edge on Windows 10

Rather than repeat the same Chrome TLS fingerprints on every request, use a mix of fingerprints.

Do Headless Browsers Have Different TLS Handshakes?

Nope! Running a browser "headless" (without UI) does not affect the TLS handshake.

For example, a headless Chrome puppeteer script will have the same handshake as a full Chrome browser.

The headless mode only impacts browser rendering. The networking stack is shared.

Curl-Based Clients Can Impersonate Browsers

Many programming languages provide HTTP clients built around libcurl like Python‘s PycURL.

curl-impersonate is a custom version of libcurl allowing modified TLS handshakes.

When using curl-impersonate, you can configure the ClientHello cipher suites, extensions, and other values to impersonate specific browsers.

For example, this Python script leverages curl-impersonate to spoof a Chrome handshake:

import pycurl

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169" 

curl = pycurl.Curl()
curl.setopt(pycurl.USERAGENT, user_agent)

# Enable curl-impersonate option  
curl.setopt(CURLOPT_IMPERSONATE, "Chrome 74 on Windows 10")

curl.perform() 

The same concept applies in other languages like Go, Rust, PHP, etc when built on curl-impersonate.

Countermeasures to TLS Fingerprint Blocking

Now that you understand how TLS handshakes can be fingerprinted and used to block scrapers, here are some countermeasures to consider:

Use Proxy Services Handling TLS Fingerprinting

The easiest option is to use a commercial proxy service like Smartproxy that applies TLS fingerprint spoofing on your behalf.

By routing requests through their residential IPs and proxies, you benefit from their TLS configurations done for you.

Route Traffic Through Diverse Proxy Networks

Distribute your scraping traffic across a mix of residential and mobile proxy servers to diversify TLS fingerprints.

Services like GeoSurf provide proxy IPs from hundreds of networks globally. This avoids repeating the same fingerprints.

Implement Custom TLS Handshake Values

In languages like Go, you can programatically generate a unique and rotating list of custom TLS handshake attributes to maximize fingerprint diversity.

Customize the cipher suites, extensions, curve orders, and TLS versions.

Deploy a Scraping API Handling TLS for You

Tools like ScrapingBee and ProxyCrawl offer smart scraping APIs where TLS handshake challenges are abstracted away from you.

The APIs manage fingerprint configuration, rotation, and residential proxies under the hood.

Understand TLS Fundamentals

While proxies and APIs are easier, it‘s still beneficial to learn TLS handshake basics.

Knowing how to analyze and configure handshakes helps strengthen your overall evasion abilities if needed.

Hopefully this guide provided a solid overview of how TLS handshakes are identified and used to block scrapers. While handshake fingerprinting poses challenges, with the right tools and techniques you can still access any target successfully.

Please reach out if you have any other questions! I‘m always happy to help the web scraping community.

Join the conversation

Your email address will not be published. Required fields are marked *