Skip to content

The Ultimate Guide to Web Scraping with Ruby in 2024

Web scraping is the process of programmatically extracting data from websites. It‘s an incredibly useful technique that allows you to gather information at scale from across the internet. Some common use cases include:

  • Price monitoring for e-commerce
  • Aggregating job listings, real estate, or other classifieds
  • Building machine learning datasets
  • Archiving websites for historical research
  • Automating business processes that rely on web data

The Ruby programming language provides a fantastic ecosystem for building web scrapers. In this guide, we‘ll take an in-depth look at the tools and techniques for web scraping with Ruby in 2024. Let‘s dive in!

Overview of Ruby Web Scraping Libraries

Here are the key libraries you should know about for web scraping in Ruby:

  • Nokogiri – Nokogiri is a gem that allows you to parse HTML and XML documents. It provides a convenient way to extract data using CSS selectors and XPath expressions.

  • HTTParty – HTTParty makes it easy to make HTTP requests from Ruby. You can use it to fetch the HTML of web pages you want to scrape.

  • Mechanize – Mechanize is a gem that automates interaction with websites. It allows you to programmatically navigate pages, fill out and submit forms, and follow links.

  • Watir – Watir is a gem that allows you to automate and test web browsers. It‘s useful for scraping JavaScript-heavy websites that require interaction to load content.

  • Selenium – Selenium is a suite of tools for automating web browsers. The Ruby bindings allow you to write scrapers that work with real browsers like Chrome.

  • Kimurai – Kimurai is a framework for building robust, large-scale web scrapers in Ruby. It handles running scrapers in parallel, retrying failed requests, and more.

Now let‘s see how to use these libraries to scrape data from the web.

Scraping a Static Website with Nokogiri

Many websites render most of their content in plain HTML on the server. You can scrape these types of static sites using Nokogiri and HTTParty. Here‘s an example that extracts data from a simple e-commerce product page:

require ‘nokogiri‘
require ‘httparty‘

def scrape_product(url)
  # Fetch HTML of the page using HTTParty
  unparsed_page = HTTParty.get(url)

  # Parse the HTML with Nokogiri
  parsed_page = Nokogiri::HTML(unparsed_page) 

  # Extract the data using CSS selectors
  product = {
    name: parsed_page.css(‘h1.product-name‘).text,
    price: parsed_page.css(‘span.price‘).text,
    description: parsed_page.css(‘div.product-description‘).text
  }

  # Return the extracted data
  product
end

puts scrape_product(‘https://example.com/products/1‘)

This code does the following:

  1. We define a scrape_product method that takes a URL as an argument.

  2. Inside the method, we use HTTParty‘s get method to fetch the HTML content of the URL.

  3. We parse the raw HTML using Nokogiri‘s HTML method. This returns a Nokogiri::HTML::Document object that we can query to extract data.

  4. To extract the relevant bits of data from the page, we use Nokogiri‘s css method and pass in CSS selectors. Here we‘re assuming the product name is in an h1 tag with a class of product-name, the price is in a span with a class of price, and so on.

  5. We store the extracted data in a Ruby hash and return it at the end of the method.

When we call scrape_product(‘https://example.com/products/1‘), it will fetch the HTML of that URL, parse it, extract the product data, and print it out.

Of course, websites in the real-world are usually more complex than this simple example. You may need to do further traversing and querying of the HTML to get the data you want. Nokogiri also supports using more complex CSS selectors and XPath to handle almost any HTML structure.

Scraping Dynamic Websites with Watir

Some websites make heavy use of JavaScript to load data asynchronously or render content on the client-side. This can make scraping with Nokogiri alone difficult or impossible.

In these cases, you can automate a real web browser like Chrome to fully load the dynamic content before extracting it. Here‘s an example of how to do that using Watir:

require ‘watir‘

def scrape_dynamic_page(url)
  # Open a new browser window
  browser = Watir::Browser.new

  # Navigate to the URL
  browser.goto url

  # Wait for the required element to load
  browser.div(id: ‘data-div‘).wait_until(&:present?)

  # Extract the data from the page
  data = browser.div(id: ‘data-div‘).text

  # Close the browser
  browser.close

  # Return the extracted data
  data
end

puts scrape_dynamic_page(‘https://example.com/dynamic-data‘)

In this example:

  1. We open a new instance of a web browser using Watir.

  2. We instruct the browser to navigate to the URL of the page we want to scrape.

  3. Since the data we want might not be immediately available, we tell Watir to wait until a specific element (in this case a div with an ID of data-div) is present on the page. This ensures the data has loaded before we try to extract it.

  4. We extract the text content of the data-div element.

  5. We close the browser to free up system resources.

  6. Finally, we return the extracted data.

Using a real browser like this allows the JavaScript on the page to fully execute before we attempt to scrape anything. The downside is that it‘s slower and more resource-intensive than using Nokogiri on server-rendered HTML.

Sometimes the data you need is behind a login form or spread across multiple pages that require interaction to access. The Mechanize gem makes these types of scenarios easy to handle.

Here‘s an example of using Mechanize to submit a search form and scrape the results:

require ‘mechanize‘

def scrape_search_results(search_term)
  # Initialize a new Mechanize agent
  agent = Mechanize.new

  # Navigate to the search page
  page = agent.get(‘https://example.com/search‘)

  # Fill out and submit the search form
  search_form = page.form_with(id: ‘search-form‘)
  search_form.field_with(name: ‘query‘).value = search_term
  results_page = agent.submit(search_form)

  # Extract the search result data  
  results = results_page.search(‘div.result‘).map do |result|
    {
      title: result.search(‘h3‘).text,
      url: result.search(‘a‘).attr(‘href‘).text
    }
  end

  # Return the results
  results
end

puts scrape_search_results(‘example search‘)

This code does the following:

  1. We create a new instance of a Mechanize agent. This will allow us to interact with web pages programmatically.

  2. We tell the agent to navigate to the search page URL.

  3. We find the search form on the page by its ID and fill out the query field with our search term.

  4. We submit the form, which returns a new page object containing the search results.

  5. We find all the search result elements on the results page (here we assume they are in div elements with a class of result).

  6. We extract the title and URL of each result and store them in a hash.

  7. We return the array of result hashes.

Mechanize also has built-in support for handling cookies, redirects, and other complications that arise when scraping websites that require multiple requests and stateful interaction.

Scaling Up with Kimurai

For large-scale web scraping projects, you‘ll want to use a framework that can handle running multiple scrapers concurrently, retrying failed requests, and storing results reliably.

The Kimurai framework provides a robust solution for these needs. Here‘s an example of a Kimurai scraper that extracts data from a list of URLs:

require ‘kimurai‘

class ExampleScraper < Kimurai::Base
  @name = ‘example_scraper‘
  @engine = :mechanize
  @start_urls = [‘https://example.com/page1‘, ‘https://example.com/page2‘]

  def parse(response, url:, data: {})
    # Extract data from the page
    data = {}
    data[:title] = response.css(‘h1‘).text
    data[:content] = response.css(‘div.content‘).text

    # Store the extracted data
    save_to "output.json", data, format: :json
  end
end

ExampleScraper.crawl!

Here‘s what this code does:

  1. We define a new ExampleScraper class that inherits from Kimurai::Base.

  2. We specify the name of our scraper, the browser engine to use (Mechanize in this case), and the list of start URLs.

  3. We define a parse method that will be called on each scraped page. It takes the HTTP response, the URL of the page, and any associated data.

  4. Inside parse, we use Nokogiri-style methods to extract the title and content from the page and store them in the data hash.

  5. We use Kimurai‘s save_to method to write the scraped data to a JSON file. Kimurai will automatically handle appending to the file for each scraped page.

  6. Finally, we tell Kimurai to start the crawl with ExampleScraper.crawl!.

Kimurai handles the details of scheduling the requests, handling errors, and managing concurrency. It also provides configuration options for politeness (respecting robots.txt, throttling request rate), browser spoofing, and proxy rotation.

Best Practices for Web Scraping

When building web scrapers, it‘s important to keep in mind some best practices to avoid getting blocked and to be respectful of the websites you‘re scraping.

Respect robots.txt

Most websites have a robots.txt file that specifies which pages are allowed to be scraped by bots. You should always check this file and avoid scraping any disallowed pages.

The robotex gem makes this easy in Ruby:

require ‘robotex‘

robotex = Robotex.new
p robotex.allowed?("https://scrapingbee.com/blog/")
# => true

Set a Reasonable Request Rate

Sending requests too rapidly can overload a website‘s servers and may get your IP address banned. It‘s best practice to include delays between your requests and avoid hitting a single domain too frequently.

You can use Ruby‘s sleep method to pause execution between requests:

require ‘httparty‘

urls.each do |url|
  puts HTTParty.get(url)
  sleep(5)  # Sleep for 5 seconds
end

Use Caching

If you‘re scraping the same pages multiple times, it‘s a good idea to cache the results locally to avoid repeated requests to the website. The VCR gem is a great tool for this:

require ‘nokogiri‘
require ‘httparty‘
require ‘vcr‘

VCR.configure do |config|
  config.cassette_library_dir = "fixtures/vcr_cassettes"
  config.hook_into :webmock
end

VCR.use_cassette(‘example_cassette‘) do
  unparsed_page = HTTParty.get(‘https://example.com‘)
  parsed_page = Nokogiri::HTML(unparsed_page)
  puts parsed_page.css(‘h1‘).text
end

The first time this code runs, VCR will make the actual HTTP request and record the response to a "cassette" file. On subsequent runs, VCR will intercept the request and replay the cached response instead of hitting the website again.

Handle Errors and Edge Cases

Web scraping is inherently brittle since you‘re depending on the structure of the website not changing. It‘s important to write defensive code that can handle errors and unexpected cases gracefully.

Some things to keep in mind:

  • Check that the elements you want to extract actually exist before trying to access them.
  • Handle network errors and timeouts when making requests.
  • Validate that the data you‘ve extracted is in the format you expect.
  • Log errors so you can debug issues with your scrapers.

Here‘s an example of defensive scraping with error handling:

require ‘nokogiri‘
require ‘httparty‘

def scrape_product(url)
  begin
    unparsed_page = HTTParty.get(url)
  rescue HTTParty::Error => e
    puts "HTTP error fetching #{url}: #{e}"
    return nil
  end

  product = {}
  parsed_page = Nokogiri::HTML(unparsed_page)

  product_name = parsed_page.at_css(‘h1.product-name‘)
  product[:name] = product_name ? product_name.text.strip : nil

  # similar defensive code for other attributes

  unless product[:name] && product[:price]
    puts "Incomplete product data for #{url}"
    return nil
  end

  product
end

Use Proxies and Rotate User Agents

To avoid getting blocked when scraping heavily, it‘s often necessary to use proxies to distribute your requests across multiple IP addresses. You should also rotate your User-Agent header to mimic different browsers.

The mechanize gem supports setting proxies and user agents globally:

require ‘mechanize‘

agent = Mechanize.new 
agent.set_proxy(‘proxy_host‘, ‘proxy_port‘)
agent.user_agent_alias = ‘Mac Safari‘

For more advanced proxy configuration and automatic IP rotation, check out the proxifier gem.

Conclusion

Web scraping is a powerful tool for extracting data from websites, and Ruby provides a great ecosystem for building scrapers. In this guide, we‘ve covered the key libraries and techniques you need to know, including:

  • Using Nokogiri to parse HTML and extract data with CSS selectors and XPath
  • Scraping dynamic JavaScript-rendered pages with Watir and Selenium
  • Navigating and submitting forms with Mechanize
  • Scaling your scrapers with the Kimurai framework
  • Best practices for respectful and reliable scraping

While we‘ve focused on scraping with Ruby, many of the same principles apply to other languages as well. With the tools and knowledge you‘ve gained from this guide, you should be well-equipped to build robust, efficient web scrapers in Ruby.

Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *