How to Extract a Table‘s Content in Ruby: The Ultimate Guide

Extracting data from HTML tables on websites is a common task in web scraping. Tables often contain valuable, structured information like product catalogs, sports statistics, financial data, and more. Fortunately, Ruby has great built-in and external libraries that make it easy to extract table data through HTTP requests and HTML parsing.

In this guide, we‘ll walk through how to scrape data from HTML tables using Ruby. Whether you‘re a beginner or experienced Rubyist, you‘ll learn the concepts and code needed to master table extraction. Let‘s dive in!

Identifying and Selecting HTML Tables

The first step is to find the right tables to extract on a web page. Tables in HTML use the <table> tag and contain rows (<tr>) and cells (<td> or <th>). Here‘s a simple example:

<table>
  <tr>
    <th>Header 1</th>
    <th>Header 2</th>
  </tr>
  <tr>
    <td>Row 1, Cell 1</td>
    <td>Row 1, Cell 2</td>
  </tr>
  <tr>
    <td>Row 2, Cell 1</td> 
    <td>Row 2, Cell 2</td>
  </tr>
</table>

To select a specific table, we need to give it a unique CSS selector. We could target it by table index, e.g. table:nth-of-type(2) for the 2nd table on the page. But it‘s better to use a selector based on a class name, ID, or other unique attribute on the <table> tag.

For example, if the table had a class "prices", we could select it in CSS with .prices. We‘ll see later how to use CSS selectors in Ruby to find the table in the page‘s HTML.

Ruby Libraries for HTTP Requests and HTML Parsing

Ruby has several great libraries for fetching web pages and extracting data from the HTML:

OpenURI – This is part of Ruby‘s standard library, so no installation is needed. It provides a simple way to make HTTP GET requests.
Net::HTTP – Also part of the standard library, this allows more control over HTTP requests, like setting headers and cookies.
HTTParty – A popular gem for making HTTP requests with a friendly API. It has built-in parsing and other convenient features.
Nokogiri – The go-to Ruby library for parsing HTML and XML documents. It uses CSS selectors or XPath to find elements in the parsed page.

We‘ll use OpenURI to fetch pages and Nokogiri to parse the HTML and extract tables. First, make sure you have Nokogiri installed:

gem install nokogiri

Extracting a Table from a Web Page

Now let‘s put it all together and extract an example table from a web page using Ruby. We‘ll scrape the list of tallest mountains from this Wikipedia page.

First, we require the needed libraries and fetch the page HTML with OpenURI:

require ‘open-uri‘
require ‘nokogiri‘

url = ‘https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth‘
html = URI.open(url).read

Next, we parse the HTML with Nokogiri and use CSS selectors to find the table rows:

doc = Nokogiri::HTML(html)

table_rows = doc.css(‘table.wikitable tr‘)

This selects all <tr> elements inside any table with the class "wikitable". We can now iterate through the rows and extract the cell text:

mountains = []

table_rows[1..-1].each do |row|
  cells = row.css(‘td‘)
  mountains << {
    rank: cells[0].text.strip,
    name: cells[1].text.strip,
    height_metres: cells[3].text.strip,
    height_feet: cells[4].text.strip
  }
end

The first row is skipped since it contains header cells (<th>). For each row, we select all <td> cells, extract their text, and store the data in a hash.

Finally, we can output the extracted data as JSON:

require ‘json‘

puts JSON.pretty_generate(mountains)

Here‘s the full script:

require ‘open-uri‘  
require ‘nokogiri‘
require ‘json‘

url = ‘https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth‘ 
html = URI.open(url).read

doc = Nokogiri::HTML(html)

table_rows = doc.css(‘table.wikitable tr‘)  

mountains = []

table_rows[1..-1].each do |row|
  cells = row.css(‘td‘)
  mountains << {
    rank: cells[0].text.strip, 
    name: cells[1].text.strip,
    height_metres: cells[3].text.strip,
    height_feet: cells[4].text.strip
  }
end

puts JSON.pretty_generate(mountains)

Running this prints the extracted table data in nicely formatted JSON:

[
  {
    "rank": "1",
    "name": "Mount Everest",
    "height_metres": "8,848",
    "height_feet": "29,031.7"
  },
  {
    "rank": "2",  
    "name": "K2",
    "height_metres": "8,611",
    "height_feet": "28,251.3"
  },
  ...
]

Handling Pagination and Multi-Page Tables

Some tables span multiple pages that require clicking to "Next" links or making paginated requests to get all the data. To handle this:

Extract the "Next" link from each page
Follow the link to fetch the next page
Repeat until there are no more pages

Here‘s a simple way to implement pagination:

require ‘open-uri‘
require ‘nokogiri‘

def scrape_table(url)
  html = URI.open(url).read
  doc = Nokogiri::HTML(html)

  table_rows = doc.css(‘table tr‘)

  data = []
  table_rows[1..-1].each do |row|
    # Extract row cells and store data
    data << { ... }  
  end

  next_link = doc.at_css(‘a.next‘)
  if next_link
    next_url = next_link[‘href‘]  
    data += scrape_table(next_url)
  end

  data  
end

url = ‘https://example.com/table?page=1‘
all_data = scrape_table(url)

The scrape_table method recursively follows "Next" links, scraping table rows from each page until the end.

Tips and Best Practices

Here are some tips to keep in mind when scraping tables with Ruby:

Be respectful of website owners and don‘t hammer servers with requests. Add delays between requests if scraping many pages.
Use caching to avoid re-fetching unchanged pages and minimize requests.
Handle errors gracefully. Use begin/rescue blocks to catch network timeouts, 404 pages, etc.
Some tables are malformed or have missing cells. Check for this and skip or fix invalid rows to avoid exceptions.
Scraping too aggressively can get your IP blocked. Use a proxy service like ScrapingBee to avoid blocks.
Respect robots.txt and terms of service. Don‘t scrape websites that prohibit it.

Conclusion

Extracting data from HTML tables with Ruby is a powerful technique to harvest structured information from websites. Whether scraping a single table or spanning multiple paginated tables, Ruby libraries like OpenURI and Nokogiri make it simple.

The general steps are:

Fetch the page HTML
Parse the HTML with Nokogiri
Select the table and iterate through rows
Extract cell data from each row and store it
Handle pagination by following "Next" links
Output the data in your desired format

By following the code samples and concepts explained here, you‘re ready to scrape data from almost any table on the web using Ruby. Just remember to always scrape ethically and respect website owners.

Happy scraping!

Identifying and Selecting HTML Tables

Ruby Libraries for HTTP Requests and HTML Parsing

Extracting a Table from a Web Page

Handling Pagination and Multi-Page Tables

Tips and Best Practices

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide