Getting Started with Web Scraping Using ScrapingBee and PHP

Web scraping is a powerful technique that allows you to extract data from websites automatically. Whether you need to collect product information, monitor prices, generate leads, or gather data for research, web scraping can save you a tremendous amount of time and effort.

However, web scraping isn‘t always straightforward. Websites are becoming increasingly sophisticated with anti-bot measures like CAPTCHAs, IP blocking, and dynamic content loading. Fortunately, tools like ScrapingBee make it easier than ever to scrape websites reliably at scale.

In this guide, we‘ll walk you through how to get started with web scraping using the ScrapingBee API and PHP. You‘ll learn how to make your first API call, extract the data you need, and handle some common challenges. Let‘s dive in!

What is ScrapingBee?

ScrapingBee is a web scraping API that handles all the complexities of scraping for you. It takes care of proxy rotation, CAPTCHAs, browser fingerprinting, and more, so you can focus on the data you want to extract.

Some key benefits of ScrapingBee include:

Easy integration with any programming language that supports HTTP requests
Automatic JavaScript rendering
Geotargeting (specify country and city)
Large proxy pool with residential IPs for reduced blocking
Customizable HTTP headers and user agent
Handles CAPTCHAs for you

ScrapingBee offers a generous free plan that allows for 1,000 free API calls per month. Paid plans start at just $29 per month for 100,000 credits.

Prerequisites

Before we get started with the tutorial, make sure you have the following:

A ScrapingBee account (sign up for free at https://www.scrapingbee.com/register)
PHP installed on your development machine
Basic knowledge of PHP and HTML

Your First API Request

ScrapingBee has a simple, RESTful API. All requests are GET requests to a specific URL with your API key included as a parameter.

Here‘s the basic structure of a ScrapingBee API call in PHP using cURL:

// Get cURL resource
$curl = curl_init();

// Set API endpoint and query parameters 
curl_setopt_array($curl, [
    CURLOPT_URL => "https://app.scrapingbee.com/api/v1?" . http_build_query([
      ‘api_key‘ => ‘YOUR_API_KEY‘,  
      ‘url‘ => ‘https://example.com‘,
    ]),
    CURLOPT_CUSTOMREQUEST => ‘GET‘,
    CURLOPT_RETURNTRANSFER => true,
]);

// Send request
$response = curl_exec($curl);

// Check for errors
if ($response === false) {
    $error = curl_error($curl);
    curl_close($curl);
    die($error);
}

// Close cURL resource  
curl_close($curl);

// Work with response data
echo $response;

Let‘s break this down line by line:

First, we initialize a new cURL resource with curl_init().
Next, we use curl_setopt_array() to configure the cURL request. We set the URL to the ScrapingBee API endpoint and pass in our parameters as a query string using http_build_query(). Be sure to replace YOUR_API_KEY with your actual ScrapingBee API key. The url parameter is the web page we want to scrape.
We set the request method to GET and tell cURL to return the response as a string by setting CURLOPT_RETURNTRANSFER to true.
We send the API request with curl_exec($curl) and store the response in the $response variable.
We check for any errors and close the cURL resource with curl_close($curl).
Finally, we can work with the response data. Here we simply echo it out.

With this basic template in place, let‘s try a real example. Say we want to scrape the latest blog posts from the ScrapingBee blog. Here‘s how we‘d modify the code:

curl_setopt_array($curl, [
    CURLOPT_URL => "https://app.scrapingbee.com/api/v1?" . http_build_query([
      ‘api_key‘ => ‘YOUR_API_KEY‘,  
      ‘url‘ => ‘https://www.scrapingbee.com/blog/‘,
    ]),
    CURLOPT_CUSTOMREQUEST => ‘GET‘,
    CURLOPT_RETURNTRANSFER => true,
]);

$response = curl_exec($curl);

// Parse HTML 
$html = str_get_html($response);

// Find all article titles and links
foreach($html->find(‘h2.title a‘) as $article) {
    echo $article->plaintext . "\n";
    echo $article->href . "\n";
    echo "\n";
}

Here‘s what we changed:

We updated the url parameter to the ScrapingBee blog URL.
After getting the response, we parse the HTML using the str_get_html function from the PHP Simple HTML DOM Parser library. This allows us to query the DOM using CSS selectors.
We find all the article titles and links by searching for h2 elements with the class title that contain an a tag.
We loop through the results and echo out the link text and URL.

This is just a taste of what you can do with ScrapingBee and PHP. With a bit of creativity, the possibilities are nearly endless.

Handling Pagination

Many websites split up large amounts of content across multiple pages. To scrape all the data, you‘ll need to navigate through the pagination links.

While you could hard-code the URLs, a smarter approach is to program your scraper to automatically detect and follow the "next" links until it reaches the last page.

Here‘s an example of scraping a paginated product listing:

function scrapeProductsPage($url) {
    $curl = curl_init();

    curl_setopt_array($curl, [
        CURLOPT_URL => "https://app.scrapingbee.com/api/v1?" . http_build_query([
          ‘api_key‘ => ‘YOUR_API_KEY‘,  
          ‘url‘ => $url,
        ]),
        CURLOPT_CUSTOMREQUEST => ‘GET‘,
        CURLOPT_RETURNTRANSFER => true,
    ]);

    $response = curl_exec($curl);
    curl_close($curl);

    $html = str_get_html($response);

    // Scrape products here
    // ...

    // Check for next page link  
    $nextLink = $html->find(‘a.next‘, 0);

    if ($nextLink) {
        $nextUrl = ‘https://example.com‘ . $nextLink->href;
        scrapeProductsPage($nextUrl); // Recurse
    }
}

scrapeProductsPage(‘https://example.com/products‘);

The key points:

We define a scrapeProductsPage function that accepts a URL. This allows us to call it recursively for each page.
After scraping the products on the current page, we look for a "next page" link, specified by the CSS selector a.next.
If we find a next link, we construct the full URL and recursively call scrapeProductsPage with the next URL.
If there‘s no next link, the function returns and the script terminates.

Using this recursive approach, we can scrape an entire paginated listing with a single function call.

Avoiding Detection and Blocks

When scraping, it‘s important to be mindful of the website‘s terms of service and robots.txt file. Some sites may disallow scraping entirely.

Even if scraping is permitted, you need to be careful not to overwhelm the server with too many requests. Make sure to insert delays between requests and avoid hitting the same page too frequently.

ScrapingBee makes it easy to fly under the radar by:

Rotating IP addresses with each request
Allowing custom delays between requests
Setting custom user agent strings and HTTP headers

Here‘s how you can configure request concurrency and delays in ScrapingBee:

$curl = curl_init();

curl_setopt_array($curl, [
    CURLOPT_URL => "https://app.scrapingbee.com/api/v1?" . http_build_query([
      ‘api_key‘ => ‘YOUR_API_KEY‘,  
      ‘url‘ => ‘https://example.com‘,
      ‘render_js‘ => ‘false‘,
      ‘wait‘ => ‘2000‘, // Wait 2 seconds
    ]),
]);

// Setting a custom user agent
curl_setopt($curl, CURLOPT_USERAGENT, ‘MyCustomUserAgent/1.0‘);

$response = curl_exec($curl);

Here we‘ve set the wait parameter to 2000 ms (2 seconds), which will insert a delay before loading the page. We‘ve also set a custom user agent string.

It‘s also a good idea to implement exponential backoff. If a request fails, wait a bit and try again, increasing the delay each time. ScrapingBee will automatically retry failed requests.

Putting it All Together

As you can see, PHP and ScrapingBee make it easy to scrape websites. To inspire you, here are a few real-world projects you could build:

Price monitoring tool: Scrape competitor prices and get notified when they change.
Lead generation: Scrape business directories to find contact information.
Search engine: Build your own niche search engine by scraping and indexing relevant sites.
Market research: Gather data on products, reviews, and customer sentiment.

I hope this guide has given you a good foundation for scraping with PHP and ScrapingBee. The best way to improve your scraping skills is to practice on real projects.

Remember, always scrape responsibly and respect the website owner‘s wishes. Happy scraping!

What is ScrapingBee?

Prerequisites

Your First API Request

Handling Pagination

Avoiding Detection and Blocks

Putting it All Together

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide