If you‘re looking to extract data from websites using C#, one of the most important decisions you‘ll make is choosing an HTML parsing library. Parsing, or extracting relevant data from the HTML code, is a critical step in web scraping. But with so many options available, how do you know which one is right for your project?
In this article, we‘ll take an in-depth look at five of the most popular open-source C# HTML parsing libraries:
- Html Agility Pack
- AngleSharp
- Fizzler
- CefSharp
- Selenium WebDriver
For each library, we‘ll cover its key features, strengths and weaknesses, and best use cases. By the end, you‘ll have a clear idea of which one is the optimal choice for your needs.
But first, let‘s briefly review what to look for in an HTML parsing library. A good one should be:
- Open-source and free to use
- Well-documented with examples and tutorials
- Actively developed and maintained
- Fast and memory-efficient
- Supports common querying methods like XPath and CSS selectors
- Outputs clean, unaltered results
With those criteria in mind, let‘s dive into our first library.
Html Agility Pack
The Html Agility Pack (HAP) is one of the most popular and mature HTML parsers for C#. It allows you to not only parse local HTML files, but also scrape and parse web pages directly.
Key Features
- Can make HTTP requests to download web pages
- Automatically cleans up inline HTML, returning plain text
- Supports XPath queries and selection by HTML tag
- Extensible with add-ons like Fizzler for CSS selectors
- Available for .NET Framework, .NET Core, and .NET 5
Performance
In benchmarks, HAP is consistently one of the fastest C# HTML parsers, even when including the time to download the page. This makes it an excellent choice for projects where speed is critical.
Ease of Use
HAP has a relatively simple API that is easy for developers to pick up quickly. Its automatic cleaning of inline HTML tags also saves you the trouble of post-processing the parsed results.
Limitations
One downside of HAP is that it does not support CSS selectors out of the box. You‘ll need to use an extension like Fizzler to add that capability.
Also, if you actually need the data inside inline HTML tags, HAP‘s cleaning process may remove information you wanted to extract.
AngleSharp
AngleSharp is another powerful HTML parsing library that aims to provide browser-level compliance. It can parse not just HTML, but also CSS, XML, and MathML.
Key Features
- Parsing results mirror actual browser output
- Supports many standards beyond just HTML
- Highly extensible with add-ons for capabilities like XPath and advanced CSS selectors
- Active development since 2013
- Targets .NET Standard so it runs on .NET Framework, .NET Core, etc.
Performance
In benchmarks, AngleSharp is a close second to HAP in terms of speed. One difference is that AngleSharp does not include HTTP requests, so that time is not factored in.
Ease of Use
AngleSharp‘s API surface is larger than HAP‘s which gives you more control but also more to learn. It returns the unaltered HTML source, so you‘ll have to handle cleaning and navigating the DOM tree yourself.
Limitations
Out of the box, AngleSharp only supports element selection by HTML tag. To get more advanced selection capabilities, you need to install add-on packages.
Also, the fact that it doesn‘t provide an HTTP client means you‘ll have to write a few extra lines of code to download the HTML.
Fizzler
Fizzler is a CSS selector library built on top of HAP. You can think of it as an add-on that augments HAP‘s capabilities.
Key Features
- Extends HAP with support for CSS selectors
- Selectors patterned after jQuery/JavaScript (QuerySelector, QuerySelectorAll)
- Leverages HAP‘s speed and HTTP client
Limitations
Fizzler isn‘t under quite as active development as HAP or AngleSharp. It‘s also not as extensively documented, with only a minimal README.
CefSharp
CefSharp is a .NET binding for the Chromium browser. It essentially allows you to automate a headless Chrome instance from C# code.
Key Features
- Renders pages using a real browser engine
- Can automate interaction with pages and scrape dynamic content
- Supports both offscreen (headless) and embedded (WPF/WinForms) modes
- Handles Chromium setup so you don‘t need a separate driver
Limitations
CefSharp is not actually a parser, just a browser automation tool. You‘ll need to combine it with a parser like HAP or AngleSharp to extract data from the returned HTML.
Currently, CefSharp only supports Windows. There are also some extra steps required to configure it for offscreen use in a server environment.
Selenium WebDriver
Selenium WebDriver is the most widely used browser automation framework. While it‘s primarily designed for testing web apps, it can also be used for scraping.
Key Features
- Automates real browsers (Chrome, Firefox, etc.)
- Cross-platform and cross-browser
- Can interact with pages, fill forms, click buttons, etc.
- Has a built-in method to get page source which can be fed to a parser
Performance
Selenium is inherently slower than the other libraries since it launches a real browser. It‘s better suited for scraping complex sites that require interaction, not rapidly parsing many pages.
Ease of Use
Selenium requires a separate driver executable for each browser, so setup is more involved. Its API focuses more on browser interaction than parsing, so you‘ll need to learn two libraries to get the data you want.
Summary
For most C# HTML parsing needs, you‘ll likely want to choose between Html Agility Pack and AngleSharp. They are both fast, flexible, and well-supported.
HAP is a little simpler to use, especially if you just need XPath support. AngleSharp gives you more control over the parsing process and supports more advanced CSS selectors via add-ons. You can extend either one‘s selector support with Fizzler.
If you need to parse pages that require JavaScript rendering or user interaction, CefSharp or Selenium are good options. CefSharp is Windows-only but easier to set up. Selenium supports more platforms and browsers but has more moving parts.
Ultimately, the "right" HTML parsing library depends on your specific use case. But with the information in this article, you‘re well-equipped to make the choice. Happy parsing!