How to Scrape Hidden Web Data

Scraping data from the modern web can often feel like a game of hide and seek. While a decade ago most of the information was readily available in the HTML, nowadays developers love to hide and obfuscate data, rendering it dynamically with JavaScript.

This presents an interesting challenge for scrapers. While we can no longer rely on parsing the raw HTML content, a plethora of data is still there in the page – we just have to know where to look.

In this comprehensive guide, we’ll explore various methods, tools and techniques that can be used to extract hidden web data.

What is Hidden Web Data?

Hidden web data refers to any data that is not directly visible in the raw HTML source of the page. This includes:

Data loaded dynamically via JavaScript after page load. For example, rendering the contents of a <div> tag by inserting dynamically created HTML elements.
Data stored in JavaScript variables and objects embedded in <script> tags. Often JSON objects containing entire datasets.
HTML content generated on user action via AJAX requests. For example, expanding comment threads or infinite scroll pagination.
Internal API request data and metadata used by the frontend to function. For example, CSRF tokens, user info, temporary caches.
Obfuscated and encrypted data aiming to stop scrapers from accessing it.

The common theme is that this data is not available in the original HTML returned from the server, but rather generated later by the JavaScript running on the page.

Modern dynamic websites rely heavily on this technique to build fast frontend experiences. All the data can be hidden away and rendered gracefully in small chunks as needed.

Unfortunately, this means scrapers have to work a little harder to get to that data. Let’s look at some ways we can do that efficiently.

Finding Hidden Data in HTML

The first step is confirming if the data we want is indeed hidden in the page JavaScript somewhere.

Here is a simple method to check:

Load the target page in the browser and locate a unique data identifier we want to scrape. For example, a product name or ID.
Disable JavaScript in the browser and reload the page. This can be done in the developer tools.
Check if the unique identifier is still present in the raw HTML source code.

If the data disappeared – it is most likely rendered dynamically by JavaScript on page load.

Now we have to dig through the HTML source to find where and how it is generating that content.

Extracting Data from Tags

One of the most common places for hidden data to reside is inside <script> tags.

This can be JSON objects, JavaScript variables, entire datasets or code that manipulates the page.

For example:

<html>
<body>

  <div id="product"></div>

  <script>
    // product data as javascript object 
    var data = {
      "product": {
        "name": "Super Product",
        "price": 99.99
      }
    }

    // data rendered on page load
    document.getElementById("product").innerHTML = data.product.name + ": £" + data.product.price;

  </script>

</body>  
</html>

Here the actual product data is stored in a JavaScript object variable called data.

The product <div> is empty to start with and populated dynamically on page load.

So to extract this data, we first have to find the relevant <script> tag in the raw HTML. This can be done with any HTML parsing library like BeautifulSoup or Parsel:

# extract scripts from HTML with BeautifulSoup
from bs4 import BeautifulSoup

html = # page HTML 
soup = BeautifulSoup(html, ‘html.parser‘)

scripts = soup.find_all(‘script‘)

Next we have to extract the data from the script content specifically.

Method 1: Load as JSON

If the data is a valid JSON object, we can simply load it directly with Python‘s json module:

import json

# find script with data variable 
script = soup.find(‘script‘, text=lambda t: ‘data =‘ in t)

# load json directly
data = json.loads(script.string)

print(data[‘product‘])
# {‘name‘: ‘Super Product‘, ‘price‘: 99.99}

This works great if the script tag specifies type="application/json".

Method 2: Regex Matching

For more complex data, we‘ll have to parse the raw JavaScript code ourselves. This is where regular expressions come in handy.

We can scan the code and extract parts that match a pattern – like our data object.

import re
import json

script = soup.find(‘script‘, text=lambda t: ‘data =‘ in t)

# match the data object by surrounding syntax 
match = re.search(r‘data = ({.+})‘, script.string)

# load matched json 
data = json.loads(match.group(1))

print(data[‘product‘])  
# {‘name‘: ‘Super Product‘, ‘price‘: 99.99}

The key is carefully crafting a regex pattern that uniquely identifies the dataset we want from the rest of the code.

Method 3: JavaScript Parsing

For advanced scraping, we may want to parse the full JavaScript code – including variables, functions and objects.

This allows extracting any data while maintaining the original structure and context.

We can use libraries like PyJavascript and Js2Py to interpret JavaScript in Python.

For example with PyJavascript:

import javascript

script = soup.find(‘script‘, text=lambda t: ‘data =‘ in t)

# init JavaScript interpreter 
js = javascript.Interpreter()

# run script to define data variable
js.execute(script.string)

# access parsed data object
print(js.context[‘data‘][‘product‘])
# {‘name‘: ‘Super Product‘, ‘price‘: 99.99}

This allows us to tap into the entire JavaScript environment, beyond just the datasets we want.

Scraping API Data from JavaScript

APIs power most of the dynamic behavior on modern websites. JavaScript makes requests to load data, submit forms or trigger interactions.

By digging into the page code, we can find these API endpoints and mimic the requests to extract data.

For example, here is a simple script that loads product data from a /api/products/123 endpoint:

async function loadProduct(){

  let response = await fetch(‘/api/products/123‘);

  let product = await response.json();

  // render product data to page
  document.getElementById("product").innerHTML = product.name;

}

loadProduct();

Once we locate this script in the HTML, we can:

Extract the API URL from the fetch() call
Analyze the AJAX request and response formats
Replicate the API request directly in Python with libraries like Requests

This allows scraping data from APIs the JavaScript relies on without executing any browser code.

Finding Data in JavaScript Variables

Page data is also commonly stored directly in JavaScript variables.

For example:

// javascript data
var products = [
  {name: "Product 1", price: 19.99}, 
  {name: "Product 2", price: 24.99}
];

function renderProducts(){
  // loop through products and render HTML
}

Here the full products list is stored in a variable called products.

To extract this, we first have to find the variable name matching our target data structure. We can use a similar regex approach:

import re
import json

# find products variable
script = soup.find(‘script‘, text=lambda t: ‘var products =‘ in t)

# match products json
match = re.search(r‘var products = ({.+});‘, script.string)  
data = json.loads(match.group(1))

print(data)
# [{name: "Product 1", price: 19.99}, {name: "Product 2", price: 24.99}]

If the data structure is more complex, we can parse the entire JavaScript environment to access any in-scope variables.

Scraping Content Loaded via AJAX

Websites frequently load content dynamically via AJAX after page load.

For example, expanding comment threads, infinite scroll pagination or tabs.

This content is not present in the initial HTML but requested from the server as needed.

We can scraper these AJAX snippets by:

Monitoring network requests on the page to identify AJAX URLs.
Reconstructing AJAX requests and sending them directly from Python code.
Parsing the AJAX responses which contain HTML/JSON data.

For example, consider this script that loads paginated data on scroll:

// initially loaded page data
var results = [ /* initial page of data */]; 

// paginate on scroll
window.addEventListener(‘scroll‘, function() {

  var page = results.length / 20 + 1;

  // request next page
  fetch(‘/data?page=‘ + page)
    .then(res => res.json())
    .then(data => {
      results.push(...data);

      // render new data
    });

});

Here we can see it is requesting pages from /data endpoint and appending content to results variable.

We can replicate these requests and scrape the data directly, avoiding having to parse the full rendered HTML.

Executing JavaScript with Headless Browsers

For the ultimate in dynamic content scraping, we can spin up a full headless browser, load the page and directly access the JavaScript environment.

This allows evaluating code, loading dynamic content and accessing any data, functions or DOM elements available to the live page.

Here is an example with Playwright in Python:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:

  browser = p.chromium.launch()
  page = browser.new_page()

  page.goto(‘https://targetpage.com‘)

  # evaluate browser context to get data
  data = page.evaluate(‘window.products‘) 

  browser.close()

print(data)

The key is using page.evaluate() to run custom JavaScript code in the context of the loaded page.

This gives us complete access to scrape any otherwise hidden data.

The downside is having to launch a full browser instance – which is slower than direct HTTP requests. So this method should be used sparingly for complex pages.

Obfuscated and Encrypted Data

Websites often deliberately obfuscate their JavaScript to prevent scraping.

Some examples include:

Minifying variable and function names to meaningless characters like a, b, fn1()
Splitting datasets across multiple variables and scripts
Encrypting/encoding data so it is not human-readable
Dynamically assembling data at run-time from fragmented pieces
Code protection techniques like packing, obfuscation, anti-debugging, VM execution

This can make parsing the JavaScript very tricky. Small code changes can easily break our scrapers.

There are a few ways to handle heavily obfuscated pages:

Use headless browsers like Playwright or Puppeteer to load executed code rather than analyzing the obfuscated source directly.
Trace code execution to understand how data is assembled – for example using browser developer tools or proxying browser traffic.
Analyze how real users interact with the page to identify data sources.
Pattern match known data structures – like product names, prices, IDs – to locate relevant code parts even if variables are obfuscated.
For encryption, try locating encryption keys or reverse engineer decryption algorithms.

Over time we can build resilience by evolving scrapers to adapt to obfuscation changes.

Scraping Hidden APIs with Proxies

Hidden web APIs often employ advanced anti-scraping techniques like IP rate limiting, captcha and bot detection to prevent access.

This is where proxies come in very handy for scraping. By routing requests through residential IPs, we can bypass many protections and access APIs at scale.

Some tips for scraping with proxies:

Use regular proxy rotation to prevent getting blocked on specific IPs
Enable proxy rotation based on regions or ISPs for wide diversity
Use backconnect proxies which provide thousands of unique IPs to cycle through
Limit request rates per proxy to mimic real user behavior
Employ proxy authorization to impersonate real devices, not just anonymous IPs
Monitor for and handle common blocks like captchas, blocking pages, 429s

With the right proxy setup, we can access practically any target site or hidden API.

Scraping Services for Hidden Data

There are also managed scraping services that are purpose-built for extracting JavaScript rendered data.

These provide browser automation, proxy management and JavaScript execution capabilities.

Some examples include:

ScrapingBee – Browser and proxy API that can evaluate JS in pages.

ScraperAPI – Headless browser API with auto proxy rotation.

Apify – Actor runtime for browser automation at scale.

ScrapeOps – Visual browser automation builder with JS extraction.

ScrapFly – Unblockable scraping API with millions of backconnect residential proxies.

These services handle all the complexities of dynamic page rendering and make scraping hidden data easy.

Key Takeaways

Here are the key points for scraping hidden website data:

Inspect pages without JavaScript to confirm data is loaded dynamically
Extract and parse scripts, variables and JSON objects from HTML
Analyze and replicate AJAX requests to access hidden APIs
Use proxies and headless browsers when needed for heavy JS sites
Pattern match and reverse engineer obfuscated code
Adapt scrapers to handle anti-bot protections

With the right techniques, practically any public website data can be extracted. We just have to know where to look!