Web scraping is a powerful technique for extracting data from websites. However, making HTTP requests to fetch web pages is often the bottleneck that limits scraping performance. By making requests concurrently, you can dramatically speed up web scraping in Ruby.
In this guide, we‘ll cover everything you need to know to make concurrent requests in Ruby for high-performance web scraping. Whether you‘re new to web scraping or an experienced Rubyist, you‘ll learn valuable tips and best practices to supercharge your scraping pipeline. Let‘s dive in!
The Power of Concurrency
Concurrency allows multiple tasks to be executed simultaneously. For web scraping, this means sending multiple HTTP requests at the same time instead of sequentially.
Here‘s an example to illustrate the power of concurrency. Let‘s say you need to scrape data from 100 pages and each request takes 1 second. Scraping the pages one-by-one would take 100 seconds. But if you make 10 concurrent requests, the total time is reduced to only 10 seconds – a 10x speed improvement!
Ruby makes it easy to achieve concurrency using threads. A Ruby script can create multiple threads, each responsible for making a request. The threads execute in parallel, allowing the script to scrape much faster.
Making HTTP Requests in Ruby
Before we dive into concurrent requests, let‘s review the basics of making HTTP requests in Ruby. The most straightforward way is using the Net::HTTP
library from the standard library.
require ‘net/http‘
require ‘uri‘
uri = URI(‘https://example.com‘)
response = Net::HTTP.get_response(uri)
puts response.body
This simple script sends a GET request to https://example.com
and prints the response body. The Net::HTTP
library supports all the standard HTTP methods like GET, POST, PUT, DELETE, etc.
For requests that require SSL/TLS encryption, some additional configuration is needed:
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
request = Net::HTTP::Get.new(uri)
response = http.request(request)
Setting use_ssl
to true enables SSL/TLS and verify_mode
configures the certificate verification strictness.
Concurrent Requests with Threads
Now let‘s see how to use threads to make concurrent requests. We‘ll extend the previous SSL example to scrape multiple URLs in parallel:
require ‘net/http‘
require ‘uri‘
urls = [
‘https://example.com‘,
‘https://example.org‘,
‘https://example.net‘,
]
threads = urls.map do |url|
Thread.new do
uri = URI(url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
request = Net::HTTP::Get.new(uri)
response = http.request(request)
puts "#{url} - #{response.code}"
end
end
threads.each(&:join)
This script does the following:
- Defines an array of URLs to scrape
- Maps each URL to a new thread that makes the request
- Each thread prints the URL and response status code
- Waits for all threads to complete using
join
The threads execute concurrently, allowing the requests to be made in parallel. This is much faster than making the requests sequentially in a loop.
It‘s important to be aware of thread safety when using concurrency. Multiple threads modifying the same data can lead to race conditions and unexpected behavior. In this example there are no shared data structures being modified, so we don‘t need any synchronization.
While threads enable concurrency, Ruby‘s implementation has limitations. Due to the Global Interpreter Lock (GIL), Ruby can only execute one thread at a time, even on multi-core CPUs. For I/O bound tasks like web requests this isn‘t an issue, but it limits the parallelism for CPU intensive tasks. Keep this in mind for complex scraping pipelines that include parsing and processing data.
Using a Concurrent HTTP Client Library
For more advanced scraping needs, it‘s often beneficial to use a concurrent HTTP client library instead of managing the threads yourself. These libraries handle the low-level connection pooling, retries, timeouts, etc. Some popular options in Ruby are:
- Typhoeus – Runs HTTP requests in parallel using libcurl
- HTTP.rb – Fast, concurrent HTTP client with a chainable API
- Faraday – Extensible HTTP client library with support for parallel adapters
Here‘s how to make concurrent requests with Typhoeus:
require ‘typhoeus‘
urls = [
‘https://example.com‘,
‘https://example.org‘,
‘https://example.net‘,
]
hydra = Typhoeus::Hydra.new(max_concurrency: 10)
requests = urls.map do |url|
request = Typhoeus::Request.new(url, followlocation: true)
hydra.queue(request)
request
end
hydra.run
requests.each do |request|
puts "#{request.base_url} - #{request.response.code}"
end
Typhoeus makes it easy to configure the max concurrency, follow redirects, set timeouts, retry failed requests, cache responses, and more. It‘s well suited for high-performance scraping pipelines.
Scraping Best Practices
When scraping websites, it‘s important to be respectful and follow best practices:
- Respect
robots.txt
– Check if the site allows scraping and honor any restrictions - Set a reasonable request rate – Limit concurrent requests and add delays between them to avoid overloading servers
- Handle errors gracefully – Expect failures, log them, and implement retries with exponential backoff
- Cache responses – Avoid repeated requests for unchanged data by caching responses in memory or on disk
- Use a queuing system – For large scraping jobs, enqueue URLs and scrape them asynchronously with multiple workers
Following these practices will help keep your scraping ethical and efficient. It‘s also a good idea to monitor your scraping pipeline and set up alerts for issues like increased failure rates or slower response times.
Scaling Up with Distributed Scraping
For very large scraping tasks, the concurrency of a single machine may not be enough. In this case you can scale up by distributing the scraping across multiple machines.
A common distributed scraping architecture consists of:
- A message queue for sending URLs to scrape and receiving results
- Multiple scraper worker instances that pull from the queue, scrape the URLs, and push results back to the queue
- A database for storing structured results data
- A controller / monitoring system to manage the workers and data flow
Some message queues commonly used for scraping are Apache Kafka, RabbitMQ, and AWS SQS. By scaling the number of worker instances, you can scrape extremely large datasets efficiently.
For easier development and deployment of distributed scraping pipelines, you can use platforms like Scrapy Cloud and Zyte. These handle the infrastructure and orchestration, allowing you to focus on the scraping logic.
Conclusion
We‘ve covered a lot about making concurrent requests in Ruby for web scraping. To recap the key points:
- Concurrency allows scraping many pages in parallel which is much faster than sequential requests
- Ruby‘s threads make it easy to implement concurrent requests
- Concurrent HTTP client libraries provide advanced functionality and performance
- Follow best practices to keep your scraping ethical and efficient
- Distributed scraping with message queues allows scaling to very large scraping tasks
As the web continues to grow in size and complexity, scraping technology will keep evolving to extract data efficiently. Techniques like headless browsers, proxies, and machine learning for avoiding detection will become increasingly important.
To dive deeper into web scraping with Ruby, I recommend the following resources:
- Bastards Book of Ruby – Web Scraping
- Scaling Ruby Web Scraping – Christoph Engelhardt
- The Ultimate Guide to Web Scraping with Ruby – Dan Nguyen
- Scrapinghub Blog – Ruby category
Happy scraping!