Make Concurrent Requests in PHP for Fast and Efficient Web Scraping

When scraping data from websites, you‘ll often need to make requests to many different pages to get all the information you need. Making these requests sequentially can be very slow, especially if you‘re scraping hundreds or thousands of pages. This is where concurrent requests come in.

Concurrent requests allow you to send out multiple HTTP requests at the same time instead of waiting for each one to complete before starting the next. By executing requests in parallel, you can dramatically speed up your web scraping pipeline and collect data much more efficiently.

In this guide, we‘ll explore several methods for making concurrent requests in PHP, along with best practices to help you scrape websites reliably at scale. Whether you‘re new to web scraping or an experienced developer, you‘ll learn valuable techniques for supercharging your PHP scraping projects.

Using cURL Multi Handles for Parallel Requests

One of the easiest ways to make concurrent requests in PHP is by using cURL multi handles. The cURL extension allows you to create multiple cURL handles, each representing a separate HTTP request, and execute them in parallel.

Here‘s a basic example of how to use cURL multi handles in PHP:

$urls = [
    ‘https://example.com/page1‘,
    ‘https://example.com/page2‘,
    ‘https://example.com/page3‘,
];

$multiHandle = curl_multi_init();
$handles = [];

foreach ($urls as $url) {
    $handle = curl_init($url);
    curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
    curl_multi_add_handle($multiHandle, $handle);
    $handles[] = $handle;
}

$running = null;
do {
    curl_multi_exec($multiHandle, $running);
} while ($running);

foreach ($handles as $handle) {
    $html = curl_multi_getcontent($handle);
    // Process the response
    curl_multi_remove_handle($multiHandle, $handle);
    curl_close($handle);
}

curl_multi_close($multiHandle);

In this code, we first create an array of URLs we want to scrape. We then initialize a cURL multi handle using curl_multi_init().

Next, we loop through the URLs and create a new cURL handle for each one using curl_init(). We set the CURLOPT_RETURNTRANSFER option to true so that the response is returned as a string instead of being output directly. We add each handle to the multi handle using curl_multi_add_handle().

To execute the requests concurrently, we call curl_multi_exec() inside a loop until all the requests have completed. This function performs the requests in parallel and populates the responses in each individual cURL handle.

Finally, we loop through the handles again to retrieve the response content using curl_multi_getcontent(), process the data as needed, and clean up the handles and multi handle.

Using cURL multi handles is a simple and effective way to make concurrent requests in PHP. However, it does have some limitations. Each cURL handle consumes a separate process or thread, which can put a strain on system resources if you‘re making a large number of requests. Additionally, cURL multi handles don‘t provide a way to limit the concurrency or implement rate limiting directly.

Parallel Processing with pthreads or parallel

For more advanced concurrency needs, you can use parallel processing extensions like pthreads or parallel to run scraping tasks concurrently. These extensions allow you to create and manage threads in PHP, enabling you to execute multiple scripts or functions simultaneously.

Here‘s an example of how you might use the parallel extension to scrape websites concurrently:

use parallel\Runtime;

$urls = [
    ‘https://example.com/page1‘,
    ‘https://example.com/page2‘,
    ‘https://example.com/page3‘,
];

$runtime = new Runtime();

foreach ($urls as $url) {
    $runtime->run(function() use ($url) {
        $html = file_get_contents($url);
        // Process the response
    });
}

$runtime->close();

In this code, we create a new parallel\Runtime object, which manages the execution of parallel tasks. We then loop through the URLs and use the run() method to execute an anonymous function for each URL concurrently.

Inside the anonymous function, we use file_get_contents() to fetch the HTML content of the URL and process it as needed. The use statement allows us to pass the $url variable into the function‘s scope.

Using parallel processing extensions gives you more control over concurrency and allows you to run complex scraping tasks in parallel. However, it does require a deeper understanding of threading and synchronization to avoid issues like race conditions and deadlocks.

Asynchronous I/O with Amp or ReactPHP

Another approach to concurrency in PHP is using asynchronous I/O libraries like Amp or ReactPHP. These libraries allow you to perform non-blocking I/O operations, such as making HTTP requests, without blocking the execution of other code.

Here‘s an example of how to use the Amp library to make concurrent requests:

use Amp\Artax\DefaultClient;
use Amp\Promise;

$urls = [
    ‘https://example.com/page1‘,
    ‘https://example.com/page2‘,
    ‘https://example.com/page3‘,
];

$client = new DefaultClient();

$promises = [];
foreach ($urls as $url) {
    $promise = $client->request($url);
    $promises[] = $promise->then(function ($response) {
        return $response->getBody();
    });
}

$responses = Promise\wait(Promise\all($promises));

foreach ($responses as $html) {
    // Process the response
}

In this code, we create a new DefaultClient object from the Amp Artax library, which is an asynchronous HTTP client. We then loop through the URLs and use the request() method to send an asynchronous request for each URL.

The request() method returns a promise, which represents the eventual result of the asynchronous operation. We use the then() method to register a callback that will be invoked when the response is received. The callback extracts the response body using getBody().

We collect all the promises in an array and use Promise\all() to wait for all the promises to resolve. Once all the promises have resolved, we can access the response bodies in the $responses array and process them as needed.

Using asynchronous I/O libraries allows you to perform concurrent operations without the overhead of managing threads or processes directly. However, it does require a different programming model and can be more complex to reason about than traditional sequential code.

Best Practices for PHP Concurrency

When making concurrent requests in PHP, there are several best practices to keep in mind to ensure your scraping pipeline is reliable and efficient:

Manage shared resources carefully: If your concurrent tasks access shared resources, such as files or databases, make sure to use appropriate synchronization mechanisms to avoid race conditions and data corruption.
Handle errors and timeouts gracefully: Concurrent requests can fail for various reasons, such as network issues or server errors. Make sure to implement proper error handling and timeouts to avoid hanging requests and crash your scraper.
Throttle requests to avoid overloading servers: Sending too many requests too quickly can overwhelm servers and get your IP banned. Implement rate limiting and throttling to ensure you‘re scraping responsibly and within the target website‘s acceptable limits.
Use queues for scalability: If you need to scrape a large number of pages, consider using a queue system to manage the tasks. This allows you to decouple the scraping process from the task generation and enables you to scale your scraper horizontally.

Concurrency with ScrapingBee

If you‘re looking for a hassle-free way to make concurrent requests for web scraping, consider using the ScrapingBee API. ScrapingBee is a web scraping API that handles the complexity of concurrent requests, proxy rotation, and CAPTCHAs for you.

The ScrapingBee API is designed to allow multiple concurrent scraping operations, enabling you to scrape hundreds, thousands, or even millions of pages per day, depending on your plan. The more concurrent requests limit you have, the more calls you can have active in parallel, and the faster you can scrape.

Here‘s an example of how to make concurrent requests to ScrapingBee in PHP:

$urls = [
    ‘https://example.com/page1‘,
    ‘https://example.com/page2‘,
    ‘https://example.com/page3‘,
];

$apiKey = ‘YOUR_API_KEY‘;
$concurrency = 3;

$client = new GuzzleHttp\Client([
    ‘base_uri‘ => ‘https://app.scrapingbee.com/api/v1/‘,
    ‘timeout‘  => 30.0,
]);

$requests = function ($urls) use ($client, $apiKey) {
    foreach ($urls as $url) {
        yield function () use ($client, $url, $apiKey) {
            return $client->getAsync(‘‘, [
                ‘query‘ => [
                    ‘api_key‘ => $apiKey,
                    ‘url‘ => $url,
                    ‘render_js‘ => true,
                ],
            ]);
        };
    }
};

$pool = new GuzzleHttp\Pool($client, $requests($urls), [
    ‘concurrency‘ => $concurrency,
    ‘fulfilled‘ => function ($response, $index) {
        $html = $response->getBody()->getContents();
        // Process the response
    },
    ‘rejected‘ => function ($reason, $index) {
        // Handle the error
    },
]);

$promise = $pool->promise();
$promise->wait();

In this code, we first create an array of URLs we want to scrape and set our ScrapingBee API key and desired concurrency level.

We then create a new Guzzle HTTP client with the base URI set to the ScrapingBee API endpoint and a timeout of 30 seconds.

Next, we define a generator function that yields a new anonymous function for each URL. The anonymous function returns a promise that resolves to the response of a GET request to the ScrapingBee API, passing the API key, URL, and any additional options (such as JavaScript rendering) as query parameters.

We create a new GuzzleHttp\Pool object, passing the client, the requests generator, and an array of options. The concurrency option sets the maximum number of requests that can be executed concurrently. The fulfilled and rejected options define callbacks to handle the response or error for each request.

Finally, we obtain the promise representing the completion of all requests from the pool and wait for it to resolve using promise() and wait().

Using ScrapingBee for concurrent requests simplifies the scraping process and allows you to focus on extracting and processing the data you need without worrying about the underlying infrastructure.

Conclusion

Concurrent requests are a powerful technique for speeding up web scraping in PHP. By executing multiple requests in parallel, you can collect data from websites much more efficiently than making requests sequentially.

In this guide, we explored several methods for making concurrent requests in PHP, including using cURL multi handles, parallel processing extensions, and asynchronous I/O libraries. We also discussed best practices for managing concurrency, such as handling errors, throttling requests, and using queues for scalability.

Additionally, we looked at how the ScrapingBee API simplifies concurrent scraping by handling the infrastructure and allowing you to make concurrent requests with ease.

By mastering concurrent requests in PHP, you can take your web scraping projects to the next level and gather data at scale. Whether you choose to implement concurrency yourself or use a service like ScrapingBee, the techniques and best practices covered in this guide will help you scrape websites efficiently and reliably.

Happy scraping!

Using cURL Multi Handles for Parallel Requests

Parallel Processing with pthreads or parallel

Asynchronous I/O with Amp or ReactPHP

Best Practices for PHP Concurrency

Concurrency with ScrapingBee

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide