How to Ignore Non-HTML URLs When Web Crawling Using C#

Introduction

Web crawling is the process of programmatically visiting web pages and extracting their content and outgoing links to recursively crawl a website. When building a web crawler, it‘s important to only fetch and parse actual HTML web pages while ignoring other types of files like images, videos, PDFs, etc. This article will explain two effective methods for filtering out non-HTML URLs in a C# web crawler:

Checking the URL suffix against a list of unwanted file extensions
Performing a lightweight HEAD request and inspecting the Content-Type response header

By implementing these techniques, you can increase the efficiency of your crawler and avoid unnecessary requests and processing. Let‘s dive into the details of each method and see how to apply them in practice.

Method 1: Check URL Suffixes for Unwanted File Extensions

One simple way to ignore non-HTML URLs is to look at the file extension at the end of the URL string. You can maintain a list of common extensions used for images, media files, documents and other non-webpage assets. Then before fetching a URL, extract its suffix and skip it if it matches one of those extensions.

Here‘s a code snippet showing how to filter URLs using this approach in C#:

HashSet<string> UnwantedExtensions = new HashSet<string> { 
  "jpg", "jpeg", "png", "gif", "bmp", "tiff",  // Images
  "mp4", "avi", "mov", "wmv", "flv", "webm",   // Videos 
  "mp3", "wav", "aac", "flac", "ogg",          // Audio
  "pdf", "doc", "docx", "xls", "xlsx", "ppt",  // Documents
  "zip", "tar", "gz", "rar", "7z"              // Archives
}; 

Uri uri = new Uri("http://example.com/image.jpg");
string extension = Path.GetExtension(uri.AbsolutePath).ToLower().TrimStart(‘.‘);

if (UnwantedExtensions.Contains(extension))
{
   Console.WriteLine($"Skipping URL with extension ‘{extension}‘: {uri}");
}
else 
{
   Console.WriteLine($"Fetching URL: {uri}");
   // Crawl the HTML page...
}

This code defines a HashSet of unwanted file extensions, then for a given URL, it extracts the extension using Path.GetExtension(). If the lowercase, dot-trimmed extension exists in the UnwantedExtensions set, the URL is skipped. Otherwise, it will be fetched and crawled as an HTML page.

The big advantage of this method is that it‘s very fast – just a simple string operation per URL. The downside is that it relies on URLs having the correct file extension, which may not always be the case. It‘s also possible to have HTML pages with .asp or .php extensions that we would want to crawl but might get filtered out.

Method 2: Perform a HEAD Request and Inspect Content-Type Header

A more robust, but slightly slower, approach is to perform a lightweight HEAD request to each URL and check the Content-Type header in the response. A HEAD request is just like a GET request except it asks the server to return only the response headers and not the actual content.

The Content-Type header indicates the media type of the requested resource. For HTML pages, the Content-Type will typically be "text/html". By checking this header value, you can more reliably determine if a URL points to an HTML page or some other type of file.

Here‘s how to implement this in C#:

HttpClient httpClient = new HttpClient();

Uri uri = new Uri("http://example.com/page");
HttpRequestMessage headRequest = new HttpRequestMessage(HttpMethod.Head, uri);
HttpResponseMessage response = await httpClient.SendAsync(headRequest);

string contentType = response.Content.Headers.ContentType?.MediaType;

if (contentType == "text/html")
{
   Console.WriteLine($"Fetching HTML page: {uri}");
   // Crawl the HTML page...
}  
else
{
   Console.WriteLine($"Skipping non-HTML URL with Content-Type ‘{contentType}‘: {uri}");
}

This sends an asynchronous HEAD request using HttpClient and retrieves the Content-Type header from the response. If the media type is "text/html", the URL is fetched and parsed as an HTML page. Otherwise it is skipped and the non-HTML Content-Type is logged.

The advantage of HEAD requests is that they give you a more definitive answer of the content type straight from the server. The potential downside is the extra latency of the round trip to the server, which can add up for a large number of URLs. However, HEAD requests are typically quite fast since no content is returned.

Combining the Two Methods for Maximum Effectiveness

For the best coverage and efficiency, I recommend using both of the above methods together in your C# web crawler. By first checking the URL extension against a denylist, you can quickly filter out obvious non-HTML URLs without any requests. Then for URLs that make it past that check, you can perform a HEAD request to verify it‘s actually an HTML page before fully fetching it.

Here‘s what the core crawling logic might look like putting it all together:

public async Task Crawl(Uri uri) 
{
  string extension = Path.GetExtension(uri.AbsolutePath).ToLower().TrimStart(‘.‘);

  if (UnwantedExtensions.Contains(extension))
  {
    Console.WriteLine($"Skipping URL with unwanted extension ‘{extension}‘: {uri}");
    return;
  }

  if (VisitedUrls.Contains(uri))
  {
    Console.WriteLine($"Skipping already visited URL: {uri}");
    return;
  }

  HttpRequestMessage headRequest = new HttpRequestMessage(HttpMethod.Head, uri);
  HttpResponseMessage headResponse = await _httpClient.SendAsync(headRequest);

  string contentType = headResponse.Content.Headers.ContentType?.MediaType;

  if (contentType != "text/html")
  {    
    Console.WriteLine($"Skipping non-HTML URL with Content-Type ‘{contentType}‘: {uri}");
    return;
  }

  VisitedUrls.Add(uri);

  HttpResponseMessage getResponse = await _httpClient.GetAsync(uri);
  string content = await getResponse.Content.ReadAsStringAsync();

  Console.WriteLine($"Crawled URL: {uri}");

  // Parse links from HTML and queue them for crawling
  CrawlQueue.Enqueue(ExtractLinks(uri, content));
}

This Crawl method first checks the URL‘s file extension and skips it if it‘s in the unwanted extensions hashset. It then checks if the URL has already been visited and if so, skips it. Next it performs a HEAD request and verifies the Content-Type is HTML before adding the URL to the VisitedUrls hashset, fully fetching the content, and parsing out links to queue up for further crawling.

Used together like this, extension filtering and HEAD requests provide an effective one-two punch for ignoring non-HTML content while crawling. The extension check is a fast preliminary filter, while the HEAD request is the final confirmation before committing to the full page fetch.

Other Considerations and Best Practices

In addition to filtering non-HTML URLs, there are some other important things to keep in mind when building a well-behaved and efficient web crawler:

Always respect robots.txt rules and don‘t crawl pages that are disallowed. You can parse robots.txt files and check URLs against them before crawling.
Limit your crawl rate and number of concurrent requests to avoid overloading servers. A good guideline is no more than one request per second per domain.
Handle HTTP status codes, timeouts and errors gracefully. Retry failures with exponential backoff and know when to give up.
Keep track of already visited URLs using a hashset, database or efficient data structure to avoid re-crawling duplicate content. A bloom filter is a nice space-efficient option.

By following these practices and implementing HTML-only filtering, you‘ll be well on your way to building a polite and focused web crawler using C#. The techniques explained in this article should give you all you need to avoid irrelevant content and hone in on the HTML pages you care about.

Conclusion

Web crawlers are a powerful tool for exploring and analyzing content on websites, but to be efficient and polite, it‘s critical that they focus on actual HTML pages and skip over other types of URLs. In this article, we learned two complementary techniques for doing that in a C# crawler:

Filtering out unwanted URL extensions
Checking the Content-Type header with a HEAD request

We walked through how to implement both of those methods and then saw how to combine them for the most complete coverage. We also touched on some other crawler best practices like respecting robots.txt and limiting request rate.

Hopefully you now have a solid understanding of how to efficiently ignore non-HTML URLs in your web crawlers. Applying these techniques will help keep your crawler fast and focused while avoiding unnecessary load on servers.

Here are a few additional resources you may want to check out to learn more:

Happy (selective) crawling!

Introduction

Method 1: Check URL Suffixes for Unwanted File Extensions

Method 2: Perform a HEAD Request and Inspect Content-Type Header

Combining the Two Methods for Maximum Effectiveness

Other Considerations and Best Practices

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide