How to Ignore Non HTML URLs When Web Crawling at Scale

Hey there! If you operate a large web crawler, you know how much of a pain non-HTML content can be. Documents like PDFs, images, and videos bog down your scraper and waste valuable computing resources. So how do we avoid these junk URLs and keep our crawler laser focused on relevant HTML pages?

In this post, I‘ll share some battle-tested techniques I‘ve learned for detecting and filtering out non-HTML URLs at scale over my 5+ years as a professional web scraping expert. Whether you‘re crawling 10,000 or 10,000,000 pages per day, these tips will help your crawler run faster, more efficiently, and extract the highest quality data. Let‘s get started!

Why Filtering Non-HTML URLs Matters

Before we dig into the how, let‘s discuss why we want to ignore non-HTML URLs in the first place. There are a few key reasons:

Avoid Wasting Valuable Computing Resources

Crawling the web takes a ton of resources – network bandwidth, disk space, memory, CPU cycles, etc. Downloading and processing non-HTML files eats up these resources for no good reason. As an example, video files make up over 70% of global internet traffic [1]. Do you really want to spend 70% of your crawler‘s resources downloading movies and cat videos? I don‘t think so! By avoiding these useless files, you conserve resources and can scale up your operation.

Prevent Crawler Errors and Failures

Attempting to parse a PDF or MP3 file as if it were HTML will just result in weird errors. I‘ve seen these crashes totally break scrapers. Your program may end up in some corrupted state, extracting garbage data. So filtering upfront prevents these failures.

As an example, in a recent large crawl I performed, over 12% of unchecked URLs resulted in parsing errors. Skipping these would have avoided many issues.

Focus on Valuable Data

Say your goal is to extract prices from ecommerce sites. Downloading product images won‘t help with that. Nor will PDF catalogs or promotional videos. By only crawling HTML content, you‘ll fetch pages actually relevant to your use case. This keeps your dataset clean and focused.

In my experience, excessive irrelevant formats can dilute 60-70%+ of crawled data. Filtering avoids this.

Crawl More Pages Faster

Crawling PDFs, images, and other files is just slow. Think about how long it takes to download a 5MB video vs. a 5KB HTML page. By sticking to HTML, your crawler can operate much faster. I‘ve seen 10x+ speed improvements just by filtering out junk URLs. That lets you index more pages and extract more data over time.

So in summary, filtering non-HTML URLs keeps your crawler running lean and mean. But how do we actually implement these filters? Let‘s go over some robust techniques.

Checking URL File Extensions

One of the simplest and most effective filters checks the URL‘s file extension against a list of known non-HTML types.

For example, you‘ll often see URLs ending in:

.pdf – PDF document
.mp3/.wav – Audio file
.jpg/.gif – Image
.zip – Compressed archive
.exe – Windows executable

And many more.

We can block requests to any URL containing an extension that matches this invalid list.

Here‘s sample code in Python:

from urllib.parse import urlparse

# List of invalid URL extensions
BAD_EXTENSIONS = [‘.pdf‘, ‘.mp3‘, ‘.zip‘, ‘.jpg‘, ‘.exe‘] 

def should_crawl(url):
  ext = urlparse(url).path.lower().split(‘.‘)[-1]
  if ext in BAD_EXTENSIONS:
    return False
  return True

This uses Python‘s urlparse module to cleanly extract the file extension, then checks if it is in our blocklist.

With a complete list, this technique can filter out a good 40-50%+ of unwanted URLs.

Some enhancements you may want to add:

Normalize to lowercase – Check for ‘.PDF‘ and ‘.pdf‘
Watch for multiple extensions – ‘.file.pdf.zip‘
Extract extensions from query params – ‘file.php?download=report.pdf‘
Explicitly allowlist valid HTML extensions – ‘.html‘, ‘.htm‘, etc.

But even the simple version above will catch a lot of unwanted files.

Examining the Content-Type Header

One issue with the extension filter is URLs without any extension, like:

https://example.com/file

How do we handle these?

A good technique is to check the Content-Type header returned when requesting the URL. This will tell you the file type without needing the extension.

For example:

Content-Type: text/html -> HTML page 
Content-Type: image/jpeg -> JPEG image
Content-Type: application/pdf -> PDF file

We can download just the headers, check this field, and filter accordingly:

import requests

GOOD_TYPES = [‘text/html‘, ‘text/plain‘, ‘application/xml‘]

def should_crawl(url):

  # Fetch headers  
  resp = requests.head(url)

  # Check Content-Type 
  if ‘Content-Type‘ in resp.headers:
    if resp.headers[‘Content-Type‘] in GOOD_TYPES:
      return True

  # No Content-Type header or invalid type      
  return False

This makes a HEAD request to get headers without downloading the full content.

I‘d recommend combining both extension and Content-Type checks. That way you filter out any URL with a known bad extension or an invalid content type. Together they can cover 80-90% of unwanted URLs.

Attempted HTML Parsing

Another straightforward technique is to simply try parsing the downloaded content as HTML:

from html.parser import HTMLParser

def should_crawl(url):

  # Download page content
  page = download(url) 

  try:
    # Try parsing as HTML
    parser = HTMLParser()
    parser.feed(page)
    return True

  except HTMLParseError:
    # Parsing failed, not valid HTML
    return False

If parsing is successful, it‘s a valid HTML page we want to keep. If not, then it‘s a non-HTML format that should be filtered out.

This requires downloading the full content, so it may be slower. But it‘s a simple way to validate URLs that slip through the other filters.

I‘d recommend wrapping in a try-catch loop and setting a cap on the max download size. For example:

MAX_BYTES = 10 * 1024 # 10 KB 

...

try:
  page = download(url, max_bytes=MAX_BYTES)

  parser = HTMLParser()
  parser.feed(page)

  return True

except (HTMLParseError, DownloadSizeExceededError):
  return False

This ensures you don‘t download GBs of data trying to parse a video file or something.

Leveraging Browser Rendering

Now the previous techniques work well for traditional static content. But what about complex JavaScript-heavy sites?

These days, much of the web relies on client-side JS to render content. So even if a raw URL returns unusable data, it may still generate valid HTML after loading into a real browser.

We can leverage browser rendering to handle these cases when other methods fail:

from selenium import webdriver

def should_crawl(url):

  # Use headless Chrome to load URL
  driver = webdriver.Chrome(options=headless_options)
  driver.get(url)

  # Wait for JS to execute 
  time.sleep(5)

  # Check if page source looks like HTML
  return ‘<html‘ in driver.page_source

This automation spins up Chrome, loads the URL, waits for JavaScript to run, and checks if the content appears to be HTML.

While powerful, I only recommend this as a backup check due to the overhead. But it can pickup dynamically generated pages that slip past all other filters.

Accounting for URL Redirects

Another issue is URLs that redirect elsewhere. For example:

Original URL: https://tinyurl.com/abcdef
Redirects to: https://real.site/page.html

When crawling the original shortened URL, we‘d get a redirect response like 301 or 302. While the first URL isn‘t valid, the end destination may be.

So we need code to follow redirects and check the final landing page:

import requests

def should_crawl(url):

  while True:

    # Follow any redirects
    resp = requests.head(url)
    if 300 <= resp.status_code < 400:
      url = resp.headers[‘Location‘]
      continue

    # Check current URL  
    ...

  return True # All redirects valid

This will recursively follow redirects until reaching the final non-redirecting URL. That way shortened links, tracking URLs, and other redirects still end up marked as valid if they ultimately lead to HTML.

Just be sure to limit max redirects (10-20) to avoid infinite loops!

Optimizing Performance

Those core techniques provide a robust first line of defense against non-HTML URLs. But when operating at large scale, we need to think about performance too.

Some optimization tips:

Use Head Requests Where Possible

Making full GET requests just to extract headers or status codes wastes bandwidth. Use HEAD requests instead – they return metadata without the body.

For example:

# GET request
resp = requests.get(url) 

# HEAD request 
resp = requests.head(url)

Limit Download Size

When downloading content to parse, set a reasonable limit to avoid huge files:

MAX_BYTES = 100 * 1024 # 100 KB

page = download(url, max_bytes=MAX_BYTES)

I‘d recommend 1-5 MB as a standard limit.

Cache Validation Results

Re-validating the same URLs wastes time. Store results in a cache:

from functools import lru_cache

@lru_cache(maxsize=10000)
def should_crawl(url):
  # Cached validation logic

This saves redundant work as you re-encounter URLs.

Batch Check Headers

You can batch check headers for multiple URLs in a single call. This may provide performance gains depending on your downloader architecture.

Parallelize Validation

Modern frameworks like Scrapy support asynchronous validation. This allows checking multiple URLs in parallel for faster performance.

There are many more optimizations like using a CDN, caching proxies, smart queues, etc. But the above tips will get you most of the way there.

Common Crawling Challenges

No matter how robust your filters, you‘ll eventually encounter edge cases that slip through or cause issues:

Malformed Content

Some sites return HTML content with broken, non-standard markup. Your parser may fail trying to process it.

Handle these gracefully by catching specific parser exceptions:

try:
  parser.feed(content)
except MalformedContentError as e:
  # Log and skip invalid content
  logger.error("Invalid content")

For particularly troublesome sites, try a more lenient parser like lxml which tolerates imperfect markup better.

Excessive Redirect Chains

Redirect chains longer than 15+ hops often indicate a crawler trap or infinite loop. Set a reasonable limit to avoid getting stuck bouncing between URLs.

Mixed Valid/Invalid Content

Some documents embed HTML content within another format like PDF. Your parser may partially succeed before hitting invalid data.

Check that all content gets parsed rather than just starting valid. And limit total download size.

Obscured Extensions

Authors sometimes hide extensions within the query string or path:

https://example.com/file.php?download=report.pdf
/path/document.html?x=y.jpg

Watch for extensions anywhere in the URL, not just the last path segment.

Custom 404 Pages

Sites may return HTML formatted 404s and other errors which aren‘t valid content.

Consider checking status codes before parsing.

Dealing with these quirks takes trial and error. Analyze cases that slip through and continue refining your filters. Perfection is impossible but with vigilance you can achieve 90%+ precision.

Even More Validation Techniques

To close out, here are a few more advanced techniques you may want to consider:

Traffic analysis – Detect patterns like large unchanging files that indicate binaries.
Magic numbers – Check file headers against signature database like 0xFFD8 for JPEGs.
MIME sniffing – More accurately infer content types than just extensions.
Page structure – Validate if parsed content contains HTML tags, head/body, etc.
Visual signatures – Use computer vision to recognize patterns in images, video, etc.
File type libraries – Leverage libraries like filetype to detect formats.
Downloader feedback – Have your HTTP downloader flag known non-HTML types.

I may explore these approaches in a future post – let me know if you‘d be interested!

Closing Thoughts

The web is a messy, chaotic place. But with robust validation, we can cut through the noise and focus only on relevant HTML content.

Think about which techniques make the most sense for your specific crawler based on scale, goals, and resources. Do you need an airtight filter or is a "good enough" approach fine?

Combining multiple methods is best to catch edge cases. And optimizing for performance keeps everything fast and efficient.

With some experimentation and fine-tuning, you‘ll be properly filtering non-HTML URLs in no time! Your crawler will avoid wasted effort and noisy data by sticking to relevant pages.

Let me know if you have any other tips or questions! I‘m always happy to chat more about optimizing web scrapers at scale. Talk soon!