As a Ruby developer looking to extract data from websites, you‘ll inevitably need to parse HTML and XML documents. Fortunately, the Ruby ecosystem provides a wealth of open source libraries to make this task easier. But with so many options to choose from, which HTML or XML parser should you use for your web scraping project?
In this article, we‘ll take an in-depth look at the most popular Ruby HTML and XML parsing libraries. For each library, we‘ll cover its key features, strengths and limitations, and provide code examples so you can see how it works. By the end, you‘ll have all the information you need to choose the right tool for the job.
But first, let‘s start with a quick primer on HTML/XML parsers and why you might need one as a Ruby developer.
What is an HTML/XML Parser?
An HTML or XML parser is a piece of software that reads HTML or XML documents and builds a data structure that represents the document‘s content. Parsers allow you to extract data from specific parts of a document, as well as manipulate the document‘s structure.
For web scraping, an HTML parser takes in the raw HTML of a web page and turns it into a data structure you can programmatically access. This lets you pick out the parts of the page you‘re interested in and extract structured data.
Some web pages also transmit data in XML format. XML is a markup language similar to HTML, but is more general-purpose and is used to encode documents in a standard format. An XML parser allows you to read and extract data from XML.
Most HTML and XML parsers also provide ways of searching documents using XPath expressions or CSS selectors. XPath is a query language that allows you to navigate through an HTML/XML document and select nodes by various criteria. CSS selectors are patterns used to select HTML elements by tag, ID, class, or attributes. Using XPath and CSS selectors makes it easy to extract content from specific parts of the document.
Why Do Ruby Developers Need an HTML/XML Parsing Library?
So why do you need a separate library to parse HTML and XML in Ruby? Can‘t you just use regular expressions to extract what you need?
While it‘s technically possible to use regular expressions for simple scraping tasks, it quickly becomes unwieldy. HTML and XML documents often have a deep, nested structure that is difficult to handle with regular expressions alone. There are also many edge cases, like unclosed tags and HTML entity codes, that can trip up a regular expression-based parser.
In contrast, a proper HTML/XML parsing library can handle all the special cases and reliably build a complete document tree from messy real-world markup. The library also provides a nice API to let you access parts of the document by tag name, attribute, or hierarchy. This makes scraping data from web pages much easier.
Ruby‘s standard library does include a basic XML parser called REXML. However, for more advanced use cases, you‘ll likely want a third-party library that is faster, more memory efficient, and has extra features like XPath support. And for HTML parsing, the built-in tools are quite limited.
That‘s where the Ruby community has stepped up to create a variety of excellent open source HTML and XML parsing libraries. Let‘s dive in and look at the most popular options.
Nokogiri
Nokogiri is the heavyweight champ of Ruby HTML/XML parsers. It‘s by far the most popular library, with over 117 million downloads. Nokogiri is powered under the hood by native C libraries like libxml2 and libxslt, which makes it extremely fast.
Nokogiri provides a comprehensive API to parse, search, and manipulate documents using either XPath or CSS selectors. It can handle even invalid or poorly formatted HTML and "just works" most of the time. This makes it a great default choice for most scraping tasks.
Here‘s a quick example of how you might use Nokogiri to scrape data from a web page:
require ‘nokogiri‘
require ‘open-uri‘
doc = Nokogiri::HTML(URI.open(‘https://en.wikipedia.org/wiki/Web_scraping‘))
puts doc.at_css(‘h1‘).text
doc.css(‘p‘).each do |p|
puts p.text
end
This code fetches the Wikipedia article on web scraping, prints out the main heading, and then prints the text of each paragraph on the page.
Nokogiri isn‘t perfect, though. Its main downside is that it can use a lot of memory, since it eagerly parses the entire document upfront. For exceptionally large documents, you may run into out-of-memory errors.
Overall, Nokogiri is an excellent all-around choice suitable for most HTML/XML parsing needs. It has great documentation and lots of community resources available. If you aren‘t sure which parsing library to use, go with Nokogiri.
Oga
Oga is a lesser-known HTML/XML parser that focuses on speed, portability, and a small memory footprint. It‘s fully written in Ruby (with some optional native extensions for extra performance), so it‘s easy to install.
Compared to Nokogiri, Oga is typically a bit slower and supports fewer features (e.g. no support for HTML5 parsing). However, it uses significantly less memory, so it can be a good choice for memory-constrained environments or very large documents.
Like Nokogiri, Oga supports parsing, searching via XPath and CSS selectors, and modifying documents. Here‘s an example of parsing a simple document with Oga:
require ‘oga‘
doc = Oga.parse_html(‘
- Foo
- Bar
‘)
puts doc.css(‘li‘).map(&:text)
Oga also has an interesting feature called "Oga::Xml::PullParser" that allows you to parse a document incrementally. Instead of reading the entire document into memory at once, you can parse it in chunks. This can help when dealing with huge documents that may not fit in memory.
Ox
Ox is a fast XML parser written as a native extension in C. As the name "optimized XML" suggests, Ox is designed specifically for XML and sacrifices some flexibility for raw speed.
In benchmarks, Ox can be up to 10x faster than Nokogiri at XML parsing. So if pure performance is your main priority, Ox is hard to beat.
However, Ox‘s API is more limited compared other parsers. It doesn‘t have built-in support for XPath or CSS selectors. The document is represented as a hierarchy of Ruby Hash and Array objects, which you have to traverse manually. This means it can be a bit more cumbersome to use for scraping tasks compared to other libraries.
Here‘s a simple example of parsing an XML document with Ox:
require ‘ox‘
xml = ‘Hello‘
doc = Ox.parse(xml)
puts doc.locate(‘root/element‘).first
Overall, Ox is an excellent choice if you know you‘ll just be dealing with well-formed XML documents and want maximum throughput. But for more complex scraping jobs involving messy HTML, you may be better off with a more full-featured tool like Nokogiri.
REXML
REXML is a pure-Ruby XML parser that comes included in Ruby‘s standard library. Because it‘s written in pure Ruby, REXML is quite portable and easy to install. You don‘t need any extra C libraries or dependencies.
The downside is that REXML is significantly slower than libxml-based parsers like Nokogiri. It may be good enough for small XML documents, but it will struggle with large files or HTML.
Here‘s how you might use REXML to extract some data from an XML document:
require ‘rexml/document‘
xml = <<EOF
Hello
World
EOF
doc = REXML::Document.new(xml)
doc.elements.each(‘root/element‘) do |ele|
puts ele.text
end
REXML also supports XPath for searching documents, although its XPath support isn‘t as complete as Nokogiri‘s.
If you can‘t install any gems and just need a basic XML parser in a pinch, REXML will get the job done. But for HTML parsing or better performance, you‘ll want to look elsewhere.
LibXML
LibXML is a Ruby wrapper around the GNOME libxml2 and libxslt libraries. It gives you access to most of the features of these powerful native libraries within Ruby. This includes parsing HTML and XML, searching with XPath and CSS selectors, validating against a schema, and applying XSL transformations.
LibXML has bindings that map libxml2‘s data structures to Ruby objects. These are generally designed to match libxml2‘s API as closely as possible. This means the API is quite comprehensive, but not always the most Ruby-like.
Here‘s an example of parsing an HTML document and extracting the text of the title tag with LibXML:
require ‘libxml‘
html = <<-EOS
Hello, world!
Hi there!
EOS
doc = LibXML::HTML::Parser.string(html).parse
puts doc.find_first(‘//title‘).text
LibXML‘s main advantage is that it exposes the full power of libxml2 and libxslt. This includes some more advanced features like XSD validation and XSL transformation that other Ruby libraries don‘t always support. It‘s quite fast, too.
On the downside, LibXML has a lot of dependencies. Not only do you need to install the C libraries, but the Ruby gem itself can be finicky to install, especially on Windows. LibXML also isn‘t as actively maintained as some other gems — the last release was in 2011.
If you‘re already familiar with the libxml2 APIs, or you need some of its more advanced features, LibXML is worth a look. Otherwise, for HTML parsing needs you‘ll probably be better served by Nokogiri.
Choosing the Right Parser for Your Needs
With all these options to choose from, which Ruby HTML/XML parsing library should you use? Here are some general guidelines:
-
If you‘re looking for an all-around HTML and XML parser with a ton of features and great docs, go with Nokogiri. It‘s a safe default choice.
-
If you need lower memory usage and are willing to sacrifice a bit of performance and features, try Oga. It‘s a good alternative to Nokogiri.
-
For maximum parsing speed on large well-formed XML documents, Ox is the way to go. But beware its limited API.
-
If you can‘t install gems at all, Ruby‘s built-in REXML can work for basic XML parsing needs. But it will be quite slow.
-
If you need direct access to the full feature set of libxml2/libxslt, including XSD and XSLT, consider LibXML. But installing it can be tricky.
Ultimately, the choice comes down to your specific requirements. Do you favor speed, memory efficiency, or a rich feature set? Are you mostly parsing messy HTML, or well-formed XML? There‘s no one-size-fits-all answer.
My advice is to start with Nokogiri, and only reach for one of the other libraries if you notice issues with performance or memory usage. Nokogiri is a great general-purpose tool and will be more than adequate for most scraping tasks.
Handling Edge Cases and Common Gotchas
Whichever HTML/XML parser you choose, you‘re bound to run into some challenges. Here are a few tips to keep in mind:
-
Many modern websites render content dynamically with JavaScript. If the data you need isn‘t in the initial HTML payload, you may need to investigate using a headless browser solution like Capybara to scrape rendered content.
-
Be aware of rate limits and bot detection. Many sites will block you if you hit them with too many requests too quickly. Slow down your scraper and consider setting a custom user agent header.
-
Websites change all the time. The XPath expressions and CSS selectors you use to extract content can easily break. Try to write scrapers that are as resilient as possible to minor changes in the document structure.
-
Not all parsers handle invalid HTML in the same way. If a page looks like it should work but your parser is choking on it, try validating the HTML or switching to a different parser.
Wrapping Up
HTML and XML parsing are key skills to master for any Rubyist looking to scrape data from the web. With the excellent parsing libraries available in the Ruby ecosystem, you‘re well-equipped to take on the challenge.
We‘ve given you the lay of the land in terms of your options for Ruby HTML/XML parsers. You should now have a clear idea of the relative strengths and weaknesses of Nokogiri, Oga, Ox, REXML, and LibXML.
Armed with this knowledge, you‘re ready to start extracting data from websites. Remember: start with Nokogiri and only venture further if you need to. Before long, you‘ll be scraping like a pro!
Happy parsing!