Skip to content

How to Parse HTML in Ruby with Nokogiri: The Ultimate Guide

If you‘ve ever needed to extract data from websites that don‘t provide a convenient API, you know that web scraping can be a powerful technique to have in your developer toolkit. Ruby is a great language for web scraping thanks to excellent parsing libraries like Nokogiri that make it easy to select and extract information from HTML documents.

In this ultimate guide, we‘ll take an in-depth look at how to parse HTML in Ruby using the popular Nokogiri gem. You‘ll learn everything you need to know to efficiently scrape websites, from the basics of installing and using Nokogiri, to more advanced topics like traversing complex document trees and handling pagination. Let‘s dive in!

What is Nokogiri?

Nokogiri is an open source Ruby library for parsing, searching, and editing HTML and XML documents. Its name comes from the Japanese word for saw (鋸), referring to its ability to slice through markup documents.

Some key features of Nokogiri include:

  • Ability to parse messy or malformed HTML in a permissive way, just like web browsers
  • Support for using CSS selector or XPath expressions to pinpoint elements you want
  • Built-in methods to fetch documents from URLs or files
  • Fast and memory-efficient for parsing large documents

Nokogiri is the go-to library for parsing HTML in Ruby. It provides a convenient and intuitive way to extract structured data from websites which you can then use for a variety of applications like data mining, research, automated testing, or building data products.

Installing Nokogiri

Before you can start parsing HTML with Nokogiri, you‘ll need to install the gem. You can do this by simply running:

gem install nokogiri

If you‘re using Bundler to manage dependencies for a Ruby project, add this line to your Gemfile:

gem ‘nokogiri‘

And then run:

bundle install

That‘s it! You‘re now ready to start using Nokogiri in your Ruby code.

Parsing an HTML Document

Nokogiri makes it easy to parse HTML from a string, file, or URL. The Nokogiri::HTML method allows you to parse HTML and get back a parsed document that you can then search and manipulate.

Here‘s how you can load HTML from a URL:

require ‘nokogiri‘
require ‘open-uri‘

doc = Nokogiri::HTML(URI.open(‘https://example.com‘))

The URI.open method from the open-uri library fetches the HTML content from the URL. This gets passed to Nokogiri::HTML which parses it and returns a Nokogiri::HTML::Document object.

You can also parse HTML from a string:

html_string = "<html><body></body></html>"
doc = Nokogiri::HTML(html_string)

Or from a file:

doc = Nokogiri::HTML(File.read(‘path/to/file.html‘))

Searching a Parsed HTML Document

Once you have a parsed Nokogiri document, you can search it using CSS selectors. CSS selectors provide a powerful way to pinpoint specific elements within an HTML document based on tag names, attributes, classes, IDs, and more.

The two main methods for searching a document are:

  • at: Returns the first element that matches a selector
  • css: Returns a collection of all elements matching a selector

For example, to find the first

element in a document:

doc.at(‘h1‘) #=> #<Nokogiri::XML::Element:0x3fc0d1b1f0b0 name="h1" children=[#<Nokogiri::XML::Text:0x3fc0d1b1ef8c "Hello world!">]>

To find all

elements with the class "highlight":

doc.css(‘p.highlight‘) #=> [#<Nokogiri::XML::Element:0x3ff3c9a2e244 name="p" attributes=[#<Nokogiri::XML::Attr:0x3ff3c9a2e0a4 name="class" value="highlight">] children=[#<Nokogiri::XML::Text:0x3ff3c9a2dd58 "This is highlighted">]>, ...]

You can chain CSS selectors to traverse nested elements:

doc.css(‘div#main p‘) # Find all <p> elements inside <div id="main">

Nokogiri also supports using XPath expressions for searching documents, but CSS selectors are generally more concise and easier to read and maintain.

Extracting Data from Elements

Once you‘ve found the elements you want using Nokogiri‘s search methods, you can extract the desired data from them. Nokogiri provides methods to access an element‘s name, attributes, text content, HTML content, and more.

Common methods for data extraction include:

  • text: Returns the inner text of an element, stripping out HTML tags
  • []: Allows access to element attributes by name
  • children: Returns an element‘s child nodes as a NodeSet

For example:

link = doc.at(‘a‘)
link.text #=> "Click here"
link[‘href‘] #=> "https://example.com" 
link.children #=> #<Nokogiri::XML::NodeSet:0x3fd0d2b1a204 [#<Nokogiri::XML::Text:0x3fd0d2b19f30 "Click here">]>

Handling Pagination

It‘s common for websites to split content across multiple pages. When scraping such paginated data, you‘ll need a way to discover and iterate through all the relevant pages.

The exact implementation will depend on how the site handles pagination, but a common approach is to find the "Next" link and keep following it until there are no more pages left.

Here‘s an example of how you could handle pagination:

require ‘nokogiri‘
require ‘open-uri‘

def scrape_pages(url)
  doc = Nokogiri::HTML(URI.open(url))

  # Scrape data from page
  puts doc.at(‘h1‘).text
  doc.css(‘p‘).each do |p|
    puts p.text
  end

  # Find next page link
  next_link = doc.at(‘.pagination a[rel="next"]‘)

  # Follow link to next page if it exists
  if next_link
    next_url = next_link[‘href‘] 
    scrape_pages(next_url)
  end
end

scrape_pages(‘https://example.com/articles?page=1‘)

This recursive scrape_pages method fetches each page, scrapes the desired data from it, checks for a "Next" pagination link, and if found, calls itself with the next page URL. This continues until no more "Next" links are found.

Best Practices for Web Scraping with Nokogiri

When scraping websites using Nokogiri, there are some best practices to keep in mind:

  • Respect website terms of service and robots.txt. Don‘t scrape websites that prohibit it.
  • Limit the rate at which you send requests to avoid overloading servers. Add a delay between requests.
  • Handle errors and edge cases gracefully. Use begin/rescue blocks to catch exceptions.
  • Verify you are extracting the correct data. Websites may change layouts, so periodically check your scraper output.
  • Cache scraped data to avoid unnecessary requests. You can dump scraped data to files or a database for reuse.

Limitations of Nokogiri

While Nokogiri is great for parsing and extracting data from HTML, it does have some limitations:

  • It cannot execute JavaScript, so content loaded dynamically via JS will not be available to Nokogiri
  • It does not handle cookie or session management, so scraping websites that require login can be difficult
  • It provides no way to emulate user actions like form submissions or button clicks

For scraping more dynamic or interactive websites, headless browsers like Puppeteer or Selenium may be more suitable tools. But for parsing raw HTML, Nokogiri is hard to beat!

Conclusion

Nokogiri is a powerful and flexible tool for parsing HTML in Ruby. With its support for CSS selectors and small but convenient API for traversing and manipulating elements, you can extract structured data from even the messiest of websites.

In this guide, we‘ve covered all the essentials of parsing HTML with Nokogiri, including:

  • Installing and requiring Nokogiri
  • Fetching and parsing HTML documents
  • Searching documents using CSS selectors
  • Extracting data from matched elements
  • Discovering and following pagination links
  • Best practices and limitations of Nokogiri

You should now have a solid foundation for scraping websites using Ruby and Nokogiri. Remember to always be respectful when scraping and consider the load you are placing on servers. Happy parsing!

Join the conversation

Your email address will not be published. Required fields are marked *