Web scraping is the process of programmatically extracting data from websites. It‘s an incredibly useful technique that allows you to gather information at scale from across the internet. Some common use cases include:
- Price monitoring for e-commerce
- Aggregating job listings, real estate, or other classifieds
- Building machine learning datasets
- Archiving websites for historical research
- Automating business processes that rely on web data
The Ruby programming language provides a fantastic ecosystem for building web scrapers. In this guide, we‘ll take an in-depth look at the tools and techniques for web scraping with Ruby in 2024. Let‘s dive in!
Overview of Ruby Web Scraping Libraries
Here are the key libraries you should know about for web scraping in Ruby:
-
Nokogiri – Nokogiri is a gem that allows you to parse HTML and XML documents. It provides a convenient way to extract data using CSS selectors and XPath expressions.
-
HTTParty – HTTParty makes it easy to make HTTP requests from Ruby. You can use it to fetch the HTML of web pages you want to scrape.
-
Mechanize – Mechanize is a gem that automates interaction with websites. It allows you to programmatically navigate pages, fill out and submit forms, and follow links.
-
Watir – Watir is a gem that allows you to automate and test web browsers. It‘s useful for scraping JavaScript-heavy websites that require interaction to load content.
-
Selenium – Selenium is a suite of tools for automating web browsers. The Ruby bindings allow you to write scrapers that work with real browsers like Chrome.
-
Kimurai – Kimurai is a framework for building robust, large-scale web scrapers in Ruby. It handles running scrapers in parallel, retrying failed requests, and more.
Now let‘s see how to use these libraries to scrape data from the web.
Scraping a Static Website with Nokogiri
Many websites render most of their content in plain HTML on the server. You can scrape these types of static sites using Nokogiri and HTTParty. Here‘s an example that extracts data from a simple e-commerce product page:
require ‘nokogiri‘
require ‘httparty‘
def scrape_product(url)
# Fetch HTML of the page using HTTParty
unparsed_page = HTTParty.get(url)
# Parse the HTML with Nokogiri
parsed_page = Nokogiri::HTML(unparsed_page)
# Extract the data using CSS selectors
product = {
name: parsed_page.css(‘h1.product-name‘).text,
price: parsed_page.css(‘span.price‘).text,
description: parsed_page.css(‘div.product-description‘).text
}
# Return the extracted data
product
end
puts scrape_product(‘https://example.com/products/1‘)
This code does the following:
-
We define a
scrape_product
method that takes a URL as an argument. -
Inside the method, we use HTTParty‘s
get
method to fetch the HTML content of the URL. -
We parse the raw HTML using Nokogiri‘s
HTML
method. This returns aNokogiri::HTML::Document
object that we can query to extract data. -
To extract the relevant bits of data from the page, we use Nokogiri‘s
css
method and pass in CSS selectors. Here we‘re assuming the product name is in anh1
tag with a class ofproduct-name
, the price is in aspan
with a class ofprice
, and so on. -
We store the extracted data in a Ruby hash and return it at the end of the method.
When we call scrape_product(‘https://example.com/products/1‘)
, it will fetch the HTML of that URL, parse it, extract the product data, and print it out.
Of course, websites in the real-world are usually more complex than this simple example. You may need to do further traversing and querying of the HTML to get the data you want. Nokogiri also supports using more complex CSS selectors and XPath to handle almost any HTML structure.
Scraping Dynamic Websites with Watir
Some websites make heavy use of JavaScript to load data asynchronously or render content on the client-side. This can make scraping with Nokogiri alone difficult or impossible.
In these cases, you can automate a real web browser like Chrome to fully load the dynamic content before extracting it. Here‘s an example of how to do that using Watir:
require ‘watir‘
def scrape_dynamic_page(url)
# Open a new browser window
browser = Watir::Browser.new
# Navigate to the URL
browser.goto url
# Wait for the required element to load
browser.div(id: ‘data-div‘).wait_until(&:present?)
# Extract the data from the page
data = browser.div(id: ‘data-div‘).text
# Close the browser
browser.close
# Return the extracted data
data
end
puts scrape_dynamic_page(‘https://example.com/dynamic-data‘)
In this example:
-
We open a new instance of a web browser using Watir.
-
We instruct the browser to navigate to the URL of the page we want to scrape.
-
Since the data we want might not be immediately available, we tell Watir to wait until a specific element (in this case a
div
with an ID ofdata-div
) is present on the page. This ensures the data has loaded before we try to extract it. -
We extract the text content of the
data-div
element. -
We close the browser to free up system resources.
-
Finally, we return the extracted data.
Using a real browser like this allows the JavaScript on the page to fully execute before we attempt to scrape anything. The downside is that it‘s slower and more resource-intensive than using Nokogiri on server-rendered HTML.
Navigating and Submitting Forms with Mechanize
Sometimes the data you need is behind a login form or spread across multiple pages that require interaction to access. The Mechanize gem makes these types of scenarios easy to handle.
Here‘s an example of using Mechanize to submit a search form and scrape the results:
require ‘mechanize‘
def scrape_search_results(search_term)
# Initialize a new Mechanize agent
agent = Mechanize.new
# Navigate to the search page
page = agent.get(‘https://example.com/search‘)
# Fill out and submit the search form
search_form = page.form_with(id: ‘search-form‘)
search_form.field_with(name: ‘query‘).value = search_term
results_page = agent.submit(search_form)
# Extract the search result data
results = results_page.search(‘div.result‘).map do |result|
{
title: result.search(‘h3‘).text,
url: result.search(‘a‘).attr(‘href‘).text
}
end
# Return the results
results
end
puts scrape_search_results(‘example search‘)
This code does the following:
-
We create a new instance of a Mechanize agent. This will allow us to interact with web pages programmatically.
-
We tell the agent to navigate to the search page URL.
-
We find the search form on the page by its ID and fill out the query field with our search term.
-
We submit the form, which returns a new page object containing the search results.
-
We find all the search result elements on the results page (here we assume they are in
div
elements with a class ofresult
). -
We extract the title and URL of each result and store them in a hash.
-
We return the array of result hashes.
Mechanize also has built-in support for handling cookies, redirects, and other complications that arise when scraping websites that require multiple requests and stateful interaction.
Scaling Up with Kimurai
For large-scale web scraping projects, you‘ll want to use a framework that can handle running multiple scrapers concurrently, retrying failed requests, and storing results reliably.
The Kimurai framework provides a robust solution for these needs. Here‘s an example of a Kimurai scraper that extracts data from a list of URLs:
require ‘kimurai‘
class ExampleScraper < Kimurai::Base
@name = ‘example_scraper‘
@engine = :mechanize
@start_urls = [‘https://example.com/page1‘, ‘https://example.com/page2‘]
def parse(response, url:, data: {})
# Extract data from the page
data = {}
data[:title] = response.css(‘h1‘).text
data[:content] = response.css(‘div.content‘).text
# Store the extracted data
save_to "output.json", data, format: :json
end
end
ExampleScraper.crawl!
Here‘s what this code does:
-
We define a new
ExampleScraper
class that inherits fromKimurai::Base
. -
We specify the name of our scraper, the browser engine to use (Mechanize in this case), and the list of start URLs.
-
We define a
parse
method that will be called on each scraped page. It takes the HTTP response, the URL of the page, and any associated data. -
Inside
parse
, we use Nokogiri-style methods to extract the title and content from the page and store them in thedata
hash. -
We use Kimurai‘s
save_to
method to write the scraped data to a JSON file. Kimurai will automatically handle appending to the file for each scraped page. -
Finally, we tell Kimurai to start the crawl with
ExampleScraper.crawl!
.
Kimurai handles the details of scheduling the requests, handling errors, and managing concurrency. It also provides configuration options for politeness (respecting robots.txt
, throttling request rate), browser spoofing, and proxy rotation.
Best Practices for Web Scraping
When building web scrapers, it‘s important to keep in mind some best practices to avoid getting blocked and to be respectful of the websites you‘re scraping.
Respect robots.txt
Most websites have a robots.txt
file that specifies which pages are allowed to be scraped by bots. You should always check this file and avoid scraping any disallowed pages.
The robotex gem makes this easy in Ruby:
require ‘robotex‘
robotex = Robotex.new
p robotex.allowed?("https://scrapingbee.com/blog/")
# => true
Set a Reasonable Request Rate
Sending requests too rapidly can overload a website‘s servers and may get your IP address banned. It‘s best practice to include delays between your requests and avoid hitting a single domain too frequently.
You can use Ruby‘s sleep
method to pause execution between requests:
require ‘httparty‘
urls.each do |url|
puts HTTParty.get(url)
sleep(5) # Sleep for 5 seconds
end
Use Caching
If you‘re scraping the same pages multiple times, it‘s a good idea to cache the results locally to avoid repeated requests to the website. The VCR gem is a great tool for this:
require ‘nokogiri‘
require ‘httparty‘
require ‘vcr‘
VCR.configure do |config|
config.cassette_library_dir = "fixtures/vcr_cassettes"
config.hook_into :webmock
end
VCR.use_cassette(‘example_cassette‘) do
unparsed_page = HTTParty.get(‘https://example.com‘)
parsed_page = Nokogiri::HTML(unparsed_page)
puts parsed_page.css(‘h1‘).text
end
The first time this code runs, VCR will make the actual HTTP request and record the response to a "cassette" file. On subsequent runs, VCR will intercept the request and replay the cached response instead of hitting the website again.
Handle Errors and Edge Cases
Web scraping is inherently brittle since you‘re depending on the structure of the website not changing. It‘s important to write defensive code that can handle errors and unexpected cases gracefully.
Some things to keep in mind:
- Check that the elements you want to extract actually exist before trying to access them.
- Handle network errors and timeouts when making requests.
- Validate that the data you‘ve extracted is in the format you expect.
- Log errors so you can debug issues with your scrapers.
Here‘s an example of defensive scraping with error handling:
require ‘nokogiri‘
require ‘httparty‘
def scrape_product(url)
begin
unparsed_page = HTTParty.get(url)
rescue HTTParty::Error => e
puts "HTTP error fetching #{url}: #{e}"
return nil
end
product = {}
parsed_page = Nokogiri::HTML(unparsed_page)
product_name = parsed_page.at_css(‘h1.product-name‘)
product[:name] = product_name ? product_name.text.strip : nil
# similar defensive code for other attributes
unless product[:name] && product[:price]
puts "Incomplete product data for #{url}"
return nil
end
product
end
Use Proxies and Rotate User Agents
To avoid getting blocked when scraping heavily, it‘s often necessary to use proxies to distribute your requests across multiple IP addresses. You should also rotate your User-Agent header to mimic different browsers.
The mechanize
gem supports setting proxies and user agents globally:
require ‘mechanize‘
agent = Mechanize.new
agent.set_proxy(‘proxy_host‘, ‘proxy_port‘)
agent.user_agent_alias = ‘Mac Safari‘
For more advanced proxy configuration and automatic IP rotation, check out the proxifier
gem.
Conclusion
Web scraping is a powerful tool for extracting data from websites, and Ruby provides a great ecosystem for building scrapers. In this guide, we‘ve covered the key libraries and techniques you need to know, including:
- Using Nokogiri to parse HTML and extract data with CSS selectors and XPath
- Scraping dynamic JavaScript-rendered pages with Watir and Selenium
- Navigating and submitting forms with Mechanize
- Scaling your scrapers with the Kimurai framework
- Best practices for respectful and reliable scraping
While we‘ve focused on scraping with Ruby, many of the same principles apply to other languages as well. With the tools and knowledge you‘ve gained from this guide, you should be well-equipped to build robust, efficient web scrapers in Ruby.
Happy scraping!