Extracting data from HTML tables on websites is a common task in web scraping. Tables often contain valuable, structured information like product catalogs, sports statistics, financial data, and more. Fortunately, Ruby has great built-in and external libraries that make it easy to extract table data through HTTP requests and HTML parsing.
In this guide, we‘ll walk through how to scrape data from HTML tables using Ruby. Whether you‘re a beginner or experienced Rubyist, you‘ll learn the concepts and code needed to master table extraction. Let‘s dive in!
Identifying and Selecting HTML Tables
The first step is to find the right tables to extract on a web page. Tables in HTML use the <table>
tag and contain rows (<tr>
) and cells (<td>
or <th>
). Here‘s a simple example:
<table>
<tr>
<th>Header 1</th>
<th>Header 2</th>
</tr>
<tr>
<td>Row 1, Cell 1</td>
<td>Row 1, Cell 2</td>
</tr>
<tr>
<td>Row 2, Cell 1</td>
<td>Row 2, Cell 2</td>
</tr>
</table>
To select a specific table, we need to give it a unique CSS selector. We could target it by table index, e.g. table:nth-of-type(2)
for the 2nd table on the page. But it‘s better to use a selector based on a class name, ID, or other unique attribute on the <table>
tag.
For example, if the table had a class "prices", we could select it in CSS with .prices
. We‘ll see later how to use CSS selectors in Ruby to find the table in the page‘s HTML.
Ruby Libraries for HTTP Requests and HTML Parsing
Ruby has several great libraries for fetching web pages and extracting data from the HTML:
- OpenURI – This is part of Ruby‘s standard library, so no installation is needed. It provides a simple way to make HTTP GET requests.
- Net::HTTP – Also part of the standard library, this allows more control over HTTP requests, like setting headers and cookies.
- HTTParty – A popular gem for making HTTP requests with a friendly API. It has built-in parsing and other convenient features.
- Nokogiri – The go-to Ruby library for parsing HTML and XML documents. It uses CSS selectors or XPath to find elements in the parsed page.
We‘ll use OpenURI to fetch pages and Nokogiri to parse the HTML and extract tables. First, make sure you have Nokogiri installed:
gem install nokogiri
Extracting a Table from a Web Page
Now let‘s put it all together and extract an example table from a web page using Ruby. We‘ll scrape the list of tallest mountains from this Wikipedia page.
First, we require the needed libraries and fetch the page HTML with OpenURI:
require ‘open-uri‘
require ‘nokogiri‘
url = ‘https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth‘
html = URI.open(url).read
Next, we parse the HTML with Nokogiri and use CSS selectors to find the table rows:
doc = Nokogiri::HTML(html)
table_rows = doc.css(‘table.wikitable tr‘)
This selects all <tr>
elements inside any table with the class "wikitable". We can now iterate through the rows and extract the cell text:
mountains = []
table_rows[1..-1].each do |row|
cells = row.css(‘td‘)
mountains << {
rank: cells[0].text.strip,
name: cells[1].text.strip,
height_metres: cells[3].text.strip,
height_feet: cells[4].text.strip
}
end
The first row is skipped since it contains header cells (<th>
). For each row, we select all <td>
cells, extract their text, and store the data in a hash.
Finally, we can output the extracted data as JSON:
require ‘json‘
puts JSON.pretty_generate(mountains)
Here‘s the full script:
require ‘open-uri‘
require ‘nokogiri‘
require ‘json‘
url = ‘https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth‘
html = URI.open(url).read
doc = Nokogiri::HTML(html)
table_rows = doc.css(‘table.wikitable tr‘)
mountains = []
table_rows[1..-1].each do |row|
cells = row.css(‘td‘)
mountains << {
rank: cells[0].text.strip,
name: cells[1].text.strip,
height_metres: cells[3].text.strip,
height_feet: cells[4].text.strip
}
end
puts JSON.pretty_generate(mountains)
Running this prints the extracted table data in nicely formatted JSON:
[
{
"rank": "1",
"name": "Mount Everest",
"height_metres": "8,848",
"height_feet": "29,031.7"
},
{
"rank": "2",
"name": "K2",
"height_metres": "8,611",
"height_feet": "28,251.3"
},
...
]
Handling Pagination and Multi-Page Tables
Some tables span multiple pages that require clicking to "Next" links or making paginated requests to get all the data. To handle this:
- Extract the "Next" link from each page
- Follow the link to fetch the next page
- Repeat until there are no more pages
Here‘s a simple way to implement pagination:
require ‘open-uri‘
require ‘nokogiri‘
def scrape_table(url)
html = URI.open(url).read
doc = Nokogiri::HTML(html)
table_rows = doc.css(‘table tr‘)
data = []
table_rows[1..-1].each do |row|
# Extract row cells and store data
data << { ... }
end
next_link = doc.at_css(‘a.next‘)
if next_link
next_url = next_link[‘href‘]
data += scrape_table(next_url)
end
data
end
url = ‘https://example.com/table?page=1‘
all_data = scrape_table(url)
The scrape_table
method recursively follows "Next" links, scraping table rows from each page until the end.
Tips and Best Practices
Here are some tips to keep in mind when scraping tables with Ruby:
- Be respectful of website owners and don‘t hammer servers with requests. Add delays between requests if scraping many pages.
- Use caching to avoid re-fetching unchanged pages and minimize requests.
- Handle errors gracefully. Use begin/rescue blocks to catch network timeouts, 404 pages, etc.
- Some tables are malformed or have missing cells. Check for this and skip or fix invalid rows to avoid exceptions.
- Scraping too aggressively can get your IP blocked. Use a proxy service like ScrapingBee to avoid blocks.
- Respect robots.txt and terms of service. Don‘t scrape websites that prohibit it.
Conclusion
Extracting data from HTML tables with Ruby is a powerful technique to harvest structured information from websites. Whether scraping a single table or spanning multiple paginated tables, Ruby libraries like OpenURI and Nokogiri make it simple.
The general steps are:
- Fetch the page HTML
- Parse the HTML with Nokogiri
- Select the table and iterate through rows
- Extract cell data from each row and store it
- Handle pagination by following "Next" links
- Output the data in your desired format
By following the code samples and concepts explained here, you‘re ready to scrape data from almost any table on the web using Ruby. Just remember to always scrape ethically and respect website owners.
Happy scraping!