Parsing HTML in C#: Choosing the Right Library for Web Scraping

If you‘re looking to extract data from websites using C#, one of the most important decisions you‘ll make is choosing an HTML parsing library. Parsing, or extracting relevant data from the HTML code, is a critical step in web scraping. But with so many options available, how do you know which one is right for your project?

In this article, we‘ll take an in-depth look at five of the most popular open-source C# HTML parsing libraries:

Html Agility Pack
AngleSharp
Fizzler
CefSharp
Selenium WebDriver

For each library, we‘ll cover its key features, strengths and weaknesses, and best use cases. By the end, you‘ll have a clear idea of which one is the optimal choice for your needs.

But first, let‘s briefly review what to look for in an HTML parsing library. A good one should be:

Open-source and free to use
Well-documented with examples and tutorials
Actively developed and maintained
Fast and memory-efficient
Supports common querying methods like XPath and CSS selectors
Outputs clean, unaltered results

With those criteria in mind, let‘s dive into our first library.

Html Agility Pack

The Html Agility Pack (HAP) is one of the most popular and mature HTML parsers for C#. It allows you to not only parse local HTML files, but also scrape and parse web pages directly.

Key Features

Can make HTTP requests to download web pages
Automatically cleans up inline HTML, returning plain text
Supports XPath queries and selection by HTML tag
Extensible with add-ons like Fizzler for CSS selectors
Available for .NET Framework, .NET Core, and .NET 5

Performance

In benchmarks, HAP is consistently one of the fastest C# HTML parsers, even when including the time to download the page. This makes it an excellent choice for projects where speed is critical.

Ease of Use

HAP has a relatively simple API that is easy for developers to pick up quickly. Its automatic cleaning of inline HTML tags also saves you the trouble of post-processing the parsed results.

Limitations

One downside of HAP is that it does not support CSS selectors out of the box. You‘ll need to use an extension like Fizzler to add that capability.

Also, if you actually need the data inside inline HTML tags, HAP‘s cleaning process may remove information you wanted to extract.

AngleSharp

AngleSharp is another powerful HTML parsing library that aims to provide browser-level compliance. It can parse not just HTML, but also CSS, XML, and MathML.

Key Features

Parsing results mirror actual browser output
Supports many standards beyond just HTML
Highly extensible with add-ons for capabilities like XPath and advanced CSS selectors
Active development since 2013
Targets .NET Standard so it runs on .NET Framework, .NET Core, etc.

Performance

In benchmarks, AngleSharp is a close second to HAP in terms of speed. One difference is that AngleSharp does not include HTTP requests, so that time is not factored in.

Ease of Use

AngleSharp‘s API surface is larger than HAP‘s which gives you more control but also more to learn. It returns the unaltered HTML source, so you‘ll have to handle cleaning and navigating the DOM tree yourself.

Limitations

Out of the box, AngleSharp only supports element selection by HTML tag. To get more advanced selection capabilities, you need to install add-on packages.

Also, the fact that it doesn‘t provide an HTTP client means you‘ll have to write a few extra lines of code to download the HTML.

Fizzler

Fizzler is a CSS selector library built on top of HAP. You can think of it as an add-on that augments HAP‘s capabilities.

Key Features

Extends HAP with support for CSS selectors
Selectors patterned after jQuery/JavaScript (QuerySelector, QuerySelectorAll)
Leverages HAP‘s speed and HTTP client

Limitations

Fizzler isn‘t under quite as active development as HAP or AngleSharp. It‘s also not as extensively documented, with only a minimal README.

CefSharp

CefSharp is a .NET binding for the Chromium browser. It essentially allows you to automate a headless Chrome instance from C# code.

Key Features

Renders pages using a real browser engine
Can automate interaction with pages and scrape dynamic content
Supports both offscreen (headless) and embedded (WPF/WinForms) modes
Handles Chromium setup so you don‘t need a separate driver

Limitations

CefSharp is not actually a parser, just a browser automation tool. You‘ll need to combine it with a parser like HAP or AngleSharp to extract data from the returned HTML.

Currently, CefSharp only supports Windows. There are also some extra steps required to configure it for offscreen use in a server environment.

Selenium WebDriver

Selenium WebDriver is the most widely used browser automation framework. While it‘s primarily designed for testing web apps, it can also be used for scraping.

Key Features

Automates real browsers (Chrome, Firefox, etc.)
Cross-platform and cross-browser
Can interact with pages, fill forms, click buttons, etc.
Has a built-in method to get page source which can be fed to a parser

Performance

Selenium is inherently slower than the other libraries since it launches a real browser. It‘s better suited for scraping complex sites that require interaction, not rapidly parsing many pages.

Ease of Use

Selenium requires a separate driver executable for each browser, so setup is more involved. Its API focuses more on browser interaction than parsing, so you‘ll need to learn two libraries to get the data you want.

Summary

For most C# HTML parsing needs, you‘ll likely want to choose between Html Agility Pack and AngleSharp. They are both fast, flexible, and well-supported.

HAP is a little simpler to use, especially if you just need XPath support. AngleSharp gives you more control over the parsing process and supports more advanced CSS selectors via add-ons. You can extend either one‘s selector support with Fizzler.

If you need to parse pages that require JavaScript rendering or user interaction, CefSharp or Selenium are good options. CefSharp is Windows-only but easier to set up. Selenium supports more platforms and browsers but has more moving parts.

Ultimately, the "right" HTML parsing library depends on your specific use case. But with the information in this article, you‘re well-equipped to make the choice. Happy parsing!

Html Agility Pack

Key Features

Performance

Ease of Use

Limitations

AngleSharp

Key Features

Performance

Ease of Use

Limitations

Fizzler

Key Features

Limitations

CefSharp

Key Features

Limitations

Selenium WebDriver

Key Features

Performance

Ease of Use

Summary

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide