Web Scraping with the HTML Agility Pack: A Comprehensive Guide

Web scraping is an incredibly useful technique for extracting data from websites. Whether you need to pull product information, monitor competitors, or aggregate news and content, being able to programmatically extract structured data from web pages opens up a world of possibilities.

When it comes to web scraping with C#, one of the most popular and powerful tools is the HTML Agility Pack. This open source library makes it easy to parse "out of the web" HTML files using familiar paradigms like LINQ and XPath. It essentially allows you to treat messy real-world HTML like queryable structured XML.

In this guide, we‘ll walk through everything you need to know to start web scraping with the HTML Agility Pack in C#. We‘ll cover the basics of installation and setup, fundamentals of pulling and parsing HTML, and advanced techniques like using the Agility Pack with a headless browser to scrape dynamic pages. We‘ll cap it off with a real-world example, scraping top stories from Hacker News.

What is the HTML Agility Pack?

The HTML Agility Pack is an open source .NET code library that allows you to parse "out of the web" HTML files. It supports .NET Framework 4.0+ as well as .NET Core. With the HTML Agility Pack, you can load HTML and traverse the DOM using XPath or LINQ queries, and easily manipulate the parsed results.

Some key features of the HTML Agility Pack include:

Robust HTML parsing, fixing malformed HTML on the fly
XPath 1.0 support for traversing and querying the DOM
LINQ-to-Objects style DOM traversal and querying
Easy HTML manipulation via the parsed DOM
XmlDocument class interface for loading and saving HTML
High performance, low memory footprint, no external dependencies

Essentially, the HTML Agility Pack aims to make working with real-world HTML as easy as with valid XML, using familiar C# paradigms. This makes it a very useful tool for web scraping, where you often need to extract structured data from messy HTML pages.

Setting Up the HTML Agility Pack

To get started using the HTML Agility Pack in your C# project, you first need to install it from NuGet. You can do this either from the Visual Studio GUI by searching for "HtmlAgilityPack" in the NuGet package manager, or via the Package Manager Console with:

Install-Package HtmlAgilityPack

Once the package is installed and referenced in your project, you‘ll want to add the following using statement to any C# file where you plan to use the HTML Agility Pack:

using HtmlAgilityPack;

And with that, you‘re ready to start parsing some HTML!

Pulling HTML from a Web Page

Before we can parse HTML with the HTML Agility Pack, we first need to actually retrieve the HTML from a web page. There are a couple ways we can do this.

The simplest way is to use native C# classes like WebClient or HttpClient to make an HTTP request and get the HTML response as a string:

string url = "https://www.example.com"; WebClient client = new WebClient(); string html = client.DownloadString(url);

However, this approach has some limitations. Notably, it will only get the initial HTML that is returned by the server. If the page has dynamic content that is rendered via JavaScript, that won‘t be included.

For pages with dynamic content, a better approach is to use a library like Selenium to actually load the page in a real browser and then extract the final HTML. We‘ll cover this more later in the guide.

Parsing HTML with LINQ and XPath

Once we have our HTML, we can load it into the HTML Agility Pack‘s HtmlDocument object and start parsing:

HtmlDocument document = new HtmlDocument(); document.LoadHtml(html);

The HtmlDocument gives us a parsed DOM representation of the HTML that we can query and traverse using LINQ methods or XPath expressions.

For example, let‘s say we wanted to select all the links on the page. We could do that with LINQ like:

var links = document.DocumentNode.Descendants("a") .Select(node => node.GetAttributeValue("href", null)) .Where(url => !String.IsNullOrEmpty(url));

This queries the document for all "a" elements, selects the "href" attribute value, and filters out any empty URLs.

We could also do the same thing using an XPath expression:

var links = document.DocumentNode.SelectNodes("//a/@href") .Select(node => node.Value) .Where(url => !String.IsNullOrEmpty(url));

XPath tends to be more concise for complex queries, while LINQ is more idiomatic for C# developers. Which one you use is largely a matter of preference.

Scraping Hacker News

Let‘s walk through a complete example of using the HTML Agility Pack to scrape a real website. We‘ll build a simple console app that pulls the current top stories from Hacker News and outputs them as JSON.

First, let‘s make a new console app and install the HTML Agility Pack from NuGet. We‘ll also add Newtonsoft.Json to serialize our output.

Next, we‘ll define a simple Item class to represent each story:

public class Item { public string Title { get; set; } public string Url { get; set; } public int Score { get; set; } }

Then we‘ll write a function to pull the HTML from the Hacker News top stories page:

static string GetHtml(string url) { WebClient client = new WebClient(); return client.DownloadString(url); }

And a function to parse that HTML and return a collection of Items:

static List ParseHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html);


var items = doc.DocumentNode.SelectNodes("//tr[@class=‘athing‘]")
    .Select(node => new Item
    {
        Title = node.SelectSingleNode(".//a[@class=‘titlelink‘]").InnerText,
        Url = node.SelectSingleNode(".//a[@class=‘titlelink‘]").Attributes["href"].Value,
        Score = int.Parse(node.SelectSingleNode(".//span[@class=‘score‘]").InnerText.Split(‘ ‘)[0])
    })
    .ToList();

return items;

}

Here we‘re using XPath to select all the "tr" elements with the class "athing" (which represent the stories). For each story, we select the title text, URL, and score, and instantiate a new Item.

Finally, in our Main function we‘ll call these methods, serialize the result to JSON, and print it out:

static void Main(string[] args) { string html = GetHtml("https://news.ycombinator.com"); var items = ParseHtml(html);


string json = JsonConvert.SerializeObject(items, Formatting.Indented);
Console.WriteLine(json);

}

And there we have it! A simple but complete example of web scraping with the HTML Agility Pack. When we run this, we should see JSON output with the current top stories from Hacker News.

Scraping Dynamic Pages with Selenium

As mentioned earlier, sometimes you need to scrape pages that have dynamic content rendered by JavaScript. In these cases, just making an HTTP request and parsing the response HTML isn‘t enough, because the JavaScript won‘t be executed.

The solution is to use a tool like Selenium to load the page in a real browser, wait for the dynamic content to render, and then extract the final HTML.

First, install the Selenium.WebDriver package from NuGet. Then, assuming you already have Chrome installed, you can launch Chrome and navigate to a URL like this:

var driver = new ChromeDriver(); driver.Navigate().GoToUrl("https://www.example.com");

Once the page is loaded, you can extract the HTML from the browser like this:

string html = driver.PageSource;

Then you can pass that HTML to the HTML Agility Pack and parse it as normal:

HtmlDocument document = new HtmlDocument(); document.LoadHtml(html);

This technique allows you to scrape even highly dynamic, JavaScript-heavy single page applications by rendering them in a real browser first.

Conclusion

The HTML Agility Pack is a powerful tool for web scraping with C#. Its combination of robustness, performance, and familiar paradigms like LINQ and XPath make it a go-to choice for many developers.

In this guide, we covered the fundamentals of using the HTML Agility Pack, from installation to parsing HTML to querying with LINQ and XPath. We walked through a real-world example, scraping top stories from Hacker News. And we saw how to integrate the HTML Agility Pack with Selenium to scrape dynamic pages.

While the HTML Agility Pack is great, it‘s not the only tool in the C# web scraping toolbox. Other useful libraries to check out include:

ScrapySharp: A .NET port of the popular Python Scrapy web scraping framework
AngleSharp: Provides JavaScript-aware parsing and supports CSS3 selectors
Fizzler: A CSS selector engine for .NET, which can be used with the HTML Agility Pack

Ultimately, the best tool for the job will depend on the specific requirements of your project. But for most C# web scraping tasks, the HTML Agility Pack is a solid foundation to build on. Hopefully this guide has given you a good starting point for your own web scraping projects. Happy scraping!

What is the HTML Agility Pack?

Setting Up the HTML Agility Pack

Pulling HTML from a Web Page

Parsing HTML with LINQ and XPath

Scraping Hacker News

Scraping Dynamic Pages with Selenium

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide