Make Concurrent Requests in Ruby for High-Performance Web Scraping

Web scraping is a powerful technique for extracting data from websites. However, making HTTP requests to fetch web pages is often the bottleneck that limits scraping performance. By making requests concurrently, you can dramatically speed up web scraping in Ruby.

In this guide, we‘ll cover everything you need to know to make concurrent requests in Ruby for high-performance web scraping. Whether you‘re new to web scraping or an experienced Rubyist, you‘ll learn valuable tips and best practices to supercharge your scraping pipeline. Let‘s dive in!

The Power of Concurrency

Concurrency allows multiple tasks to be executed simultaneously. For web scraping, this means sending multiple HTTP requests at the same time instead of sequentially.

Here‘s an example to illustrate the power of concurrency. Let‘s say you need to scrape data from 100 pages and each request takes 1 second. Scraping the pages one-by-one would take 100 seconds. But if you make 10 concurrent requests, the total time is reduced to only 10 seconds – a 10x speed improvement!

Ruby makes it easy to achieve concurrency using threads. A Ruby script can create multiple threads, each responsible for making a request. The threads execute in parallel, allowing the script to scrape much faster.

Making HTTP Requests in Ruby

Before we dive into concurrent requests, let‘s review the basics of making HTTP requests in Ruby. The most straightforward way is using the Net::HTTP library from the standard library.

require ‘net/http‘
require ‘uri‘

uri = URI(‘https://example.com‘)

response = Net::HTTP.get_response(uri)

puts response.body

This simple script sends a GET request to https://example.com and prints the response body. The Net::HTTP library supports all the standard HTTP methods like GET, POST, PUT, DELETE, etc.

For requests that require SSL/TLS encryption, some additional configuration is needed:

http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER

request = Net::HTTP::Get.new(uri)

response = http.request(request)

Setting use_ssl to true enables SSL/TLS and verify_mode configures the certificate verification strictness.

Concurrent Requests with Threads

Now let‘s see how to use threads to make concurrent requests. We‘ll extend the previous SSL example to scrape multiple URLs in parallel:

require ‘net/http‘
require ‘uri‘

urls = [
  ‘https://example.com‘, 
  ‘https://example.org‘,
  ‘https://example.net‘,
]

threads = urls.map do |url|
  Thread.new do
    uri = URI(url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true
    http.verify_mode = OpenSSL::SSL::VERIFY_PEER

    request = Net::HTTP::Get.new(uri)
    response = http.request(request)

    puts "#{url} - #{response.code}"
  end
end

threads.each(&:join)

This script does the following:

Defines an array of URLs to scrape
Maps each URL to a new thread that makes the request
Each thread prints the URL and response status code
Waits for all threads to complete using join

The threads execute concurrently, allowing the requests to be made in parallel. This is much faster than making the requests sequentially in a loop.

It‘s important to be aware of thread safety when using concurrency. Multiple threads modifying the same data can lead to race conditions and unexpected behavior. In this example there are no shared data structures being modified, so we don‘t need any synchronization.

While threads enable concurrency, Ruby‘s implementation has limitations. Due to the Global Interpreter Lock (GIL), Ruby can only execute one thread at a time, even on multi-core CPUs. For I/O bound tasks like web requests this isn‘t an issue, but it limits the parallelism for CPU intensive tasks. Keep this in mind for complex scraping pipelines that include parsing and processing data.

Using a Concurrent HTTP Client Library

For more advanced scraping needs, it‘s often beneficial to use a concurrent HTTP client library instead of managing the threads yourself. These libraries handle the low-level connection pooling, retries, timeouts, etc. Some popular options in Ruby are:

Typhoeus – Runs HTTP requests in parallel using libcurl
HTTP.rb – Fast, concurrent HTTP client with a chainable API
Faraday – Extensible HTTP client library with support for parallel adapters

Here‘s how to make concurrent requests with Typhoeus:

require ‘typhoeus‘

urls = [
  ‘https://example.com‘, 
  ‘https://example.org‘,
  ‘https://example.net‘,  
]

hydra = Typhoeus::Hydra.new(max_concurrency: 10)

requests = urls.map do |url|
  request = Typhoeus::Request.new(url, followlocation: true)
  hydra.queue(request)
  request
end

hydra.run

requests.each do |request|
  puts "#{request.base_url} - #{request.response.code}"
end

Typhoeus makes it easy to configure the max concurrency, follow redirects, set timeouts, retry failed requests, cache responses, and more. It‘s well suited for high-performance scraping pipelines.

Scraping Best Practices

When scraping websites, it‘s important to be respectful and follow best practices:

Respect robots.txt – Check if the site allows scraping and honor any restrictions
Set a reasonable request rate – Limit concurrent requests and add delays between them to avoid overloading servers
Handle errors gracefully – Expect failures, log them, and implement retries with exponential backoff
Cache responses – Avoid repeated requests for unchanged data by caching responses in memory or on disk
Use a queuing system – For large scraping jobs, enqueue URLs and scrape them asynchronously with multiple workers

Following these practices will help keep your scraping ethical and efficient. It‘s also a good idea to monitor your scraping pipeline and set up alerts for issues like increased failure rates or slower response times.

Scaling Up with Distributed Scraping

For very large scraping tasks, the concurrency of a single machine may not be enough. In this case you can scale up by distributing the scraping across multiple machines.

A common distributed scraping architecture consists of:

A message queue for sending URLs to scrape and receiving results
Multiple scraper worker instances that pull from the queue, scrape the URLs, and push results back to the queue
A database for storing structured results data
A controller / monitoring system to manage the workers and data flow

Some message queues commonly used for scraping are Apache Kafka, RabbitMQ, and AWS SQS. By scaling the number of worker instances, you can scrape extremely large datasets efficiently.

For easier development and deployment of distributed scraping pipelines, you can use platforms like Scrapy Cloud and Zyte. These handle the infrastructure and orchestration, allowing you to focus on the scraping logic.

Conclusion

We‘ve covered a lot about making concurrent requests in Ruby for web scraping. To recap the key points:

Concurrency allows scraping many pages in parallel which is much faster than sequential requests
Ruby‘s threads make it easy to implement concurrent requests
Concurrent HTTP client libraries provide advanced functionality and performance
Follow best practices to keep your scraping ethical and efficient
Distributed scraping with message queues allows scaling to very large scraping tasks

As the web continues to grow in size and complexity, scraping technology will keep evolving to extract data efficiently. Techniques like headless browsers, proxies, and machine learning for avoiding detection will become increasingly important.

To dive deeper into web scraping with Ruby, I recommend the following resources:

Happy scraping!

The Power of Concurrency

Making HTTP Requests in Ruby

Concurrent Requests with Threads

Using a Concurrent HTTP Client Library

Scraping Best Practices

Scaling Up with Distributed Scraping

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide