Skip to content

Getting Started with ScrapingBee‘s Python SDK

Web scraping is an essential skill for any developer or data professional to have in their toolbox. Whether you need to extract data from websites for analysis, monitor prices or inventory on e-commerce sites, or aggregate content from multiple sources online, knowing how to automate the collection of public web data is extremely valuable.

However, web scraping isn‘t always easy, especially at scale. Many websites employ anti-bot measures that make scraping difficult. You may get blocked or served misleading data. And writing the code to handle JavaScript rendering, solving CAPTCHAs, managing proxies, etc. yourself can be time-consuming.

This is where ScrapingBee comes in. ScrapingBee is a web scraping API that handles all the complexity of scraping for you. It allows you to scrape any public web page as if you were using a real browser, but via a simple API call.

ScrapingBee provides a Python SDK that makes it even easier to integrate their API into your Python applications. In this guide, we‘ll walk through getting started with the SDK to scrape some real-world websites.

Installing the SDK

Before you can start using ScrapingBee in your Python code, you‘ll need to install the SDK. You can do this with a simple pip command:

pip install scrapingbee

If you have multiple versions of Python installed, make sure to use the right pip command for the version you‘re using, e.g. pip3 instead of pip.

The SDK is compatible with Python 3.6+. Once it‘s installed, you‘re ready to start making requests to the ScrapingBee API.

Your First API Request

To use ScrapingBee, you‘ll need to sign up for an account and get an API key. You can do this for free and get 1000 free API credits per month.

Once you have your API key, using the SDK to make a request looks like this:

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)

response = client.get("https://www.scrapingbee.com/blog", params={})

print(‘Response HTTP Status Code: ‘, response.status_code)
print(‘Response HTTP Response Body: ‘, response.content)

First we import the ScrapingBeeClient class from the SDK. Then we initialize it with our API key.

To actually make a request and scrape a web page, we use the get method on the client. We pass it the URL we want to scrape. This sends a GET request to the ScrapingBee API which will fetch the page and return the HTML response.

The get method returns a requests.Response object. This allows us to inspect the HTTP status code to check if the request was successful, and access the HTML content of the page via the content attribute.

Request Parameters

The real power of ScrapingBee comes from the different parameters you can pass to configure how the page is scraped. These allow you to do things like:

  • Execute JavaScript on the page
  • Stealth settings to avoid bot detection
  • Geotargeting
  • Custom headers and user agents
  • Using premium proxies
  • Capturing screenshots
  • Click or type on page elements
  • Setting a wait time for pages to load

And much more. You specify these options via the params argument. For example, let‘s use the screenshot parameter to take a full-page screenshot of a page:

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘) 

response = client.get("https://www.scrapingbee.com/blog", 
  params={
    ‘screenshot‘: True,
    ‘screenshot_full_page‘: True
  }
)

if response.ok:
    with open("./screenshot.png", "wb") as f:
        f.write(response.content)
else:
    print(‘Request failed:‘, response.status_code, response.content)

Here we‘re passing screenshot: True to tell ScrapingBee to capture a screenshot. We‘re also setting screenshot_full_page: True so it captures the full length of the page.

In the response handling, we check response.ok to make sure the request was successful. If it was, response.content will contain the binary data of the screenshot image, which we write to a file. If the request failed, we print out the error details.

Some other really useful parameters are:

  • forward_headers – Pass custom headers like user agent
  • premium_proxy – Route the request through a premium proxy in a specific country
  • js_scenario – Execute custom JavaScript on the page to interact with it
  • wait – Wait N milliseconds before returning the page response

Check the ScrapingBee Docs for the full list and more details about each one.

Real-World Scraping Examples

Let‘s walk through a few more realistic examples of using the ScrapingBee SDK to scrape some actual websites.

Extracting Structured Data

Most of the time when you‘re scraping a website, you‘re looking to extract some specific structured data from it. For example, let‘s say we wanted to scrape search results from Google. Here‘s how we could use BeautifulSoup to parse the result titles and URLs:

from scrapingbee import ScrapingBeeClient
from bs4 import BeautifulSoup

client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)

params = {
    ‘premium_proxy‘: ‘true‘,
    ‘search_engine‘: ‘google‘,
}

response = client.get("https://www.google.com/search?q=web+scraping", params=params)

soup = BeautifulSoup(response.content, ‘lxml‘)

for result in soup.select(‘.tF2Cxc‘):
    title = result.select_one(‘.DKV0Md‘).text
    link = result.select_one(‘.yuRUbf a‘)[‘href‘]
    print(f‘{title}\n{link}\n‘)

Here we‘re using premium_proxy: true to route the request through ScrapingBee‘s premium proxy network. This will use a US residential IP by default to avoid Google blocking us. We‘re also setting the search_engine param to "google" to solve any CAPTCHAs Google may show.

Then in the response, we‘re using BeautifulSoup to extract the title and URL of each result on the Google SERP. Pretty cool!

Handling Pagination

Many websites split up large amounts of data across multiple pages. To scrape all of this data, you need to be able to navigate through the pagination links and send a request for each page.

Here‘s an example of how you could use ScrapingBee to scrape a list of quotes from a paginated page on Goodreads:

from scrapingbee import ScrapingBeeClient
from bs4 import BeautifulSoup

client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘) 

def scrape_page(url):
    response = client.get(url)
    soup = BeautifulSoup(response.content, ‘lxml‘)

    for quote in soup.select(‘.quoteText‘):
        print(quote.get_text(strip=True, separator=‘ ‘)[:100] + ‘...‘)

    next_page = soup.select_one(‘.next_page‘)
    if next_page:
        next_url = ‘https://www.goodreads.com‘ + next_page[‘href‘]
        scrape_page(next_url)

scrape_page(‘https://www.goodreads.com/quotes‘)

This recursively calls scrape_page with the next page URL until the "next" link no longer exists. It extracts the quote text from each page along the way. Using BeautifulSoup makes it easy to parse the pagination links as well as the data we want from the page HTML.

JavaScript Rendering

Some websites these days are built entirely with front-end JavaScript frameworks like React, Angular, Vue, etc. Instead of returning pure HTML, these sites make additional requests to APIs to fetch the data and then render it on the page. This can make scraping difficult since the initial HTML response is often empty or minimal.

To scrape these types of sites, you need a way to fully render the JavaScript before accessing the data. ScrapingBee can handle this for you with the render_js parameter:

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)

response = client.get(
    ‘https://www.zyte.com/blog/‘, 
    params={‘render_js‘: True}
)

print(response.content)

Setting render_js: true tells ScrapingBee to render the page with a full browser before returning the HTML. This allows you to scrape sites built with JS frameworks just as easily as server-rendered HTML pages.

Handling Errors

Web scraping can be unpredictable in nature. The websites you‘re trying to scrape may go down, block your IP address, update their HTML structure, and all sorts of other issues that can cause your scraper to break.

To build a resilient scraper, you need to handle these failure cases gracefully in your code. The ScrapingBee Python SDK raises exceptions that make this easy:

from scrapingbee import ScrapingBeeClient, ScrapingBeeError

client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘) 

try:
    response = client.get("https://httpbin.org/status/500")
    if not response.ok:
        raise ScrapingBeeError(response.status_code, response.text)
except ScrapingBeeError as e:
    print(e)

Here we‘re requesting a URL that returns a 500 server error. The SDK raises a ScrapingBeeError exception which we catch and handle by printing the error details.

You could also use a try/except block to catch any other exceptions that may occur, like network errors or timeouts. And if a request fails, you could retry it a few times before giving up.

It‘s important to add error handling and retry logic to your scraper to reduce the chances of missing data or having your script crash when issues occur.

Advanced Usage

For more advanced scraping tasks, the ScrapingBee SDK provides some additional features.

Concurrency

By default, the SDK will send requests to the ScrapingBee API synchronously. This means each request will block until the previous one finishes. For better performance when you need to scrape many pages, you can enable concurrency.

Using the ScrapingBeeClient.concurrent method, you can create a client instance that will send requests asynchronously. Here‘s an example:

import time
from scrapingbee import ScrapingBeeClient

urls = [
    ‘https://httpbin.org/delay/5‘,
    ‘https://httpbin.org/delay/5‘,
    ‘https://httpbin.org/delay/5‘,
    ‘https://httpbin.org/delay/5‘,
    ‘https://httpbin.org/delay/5‘
]

client = ScrapingBeeClient().concurrent(
    n_workers=5,
    api_key=‘YOUR_API_KEY‘
)

start = time.time()

def extract_data(response):
    print(f"Scraped {response.url} in {time.time() - start:.2f} seconds")

client.get(urls, params={}, callback=extract_data)

In this example, we have a list of 5 URLs that each take 5 seconds to respond. Normally scraping these synchronously would take about 25 seconds.

But using ScrapingBeeClient.concurrent with 5 workers, we can scrape them all in parallel. Each URL will be fetched in a separate thread, reducing the total runtime to only about 5 seconds.

You just pass the list of URLs to the get method instead of a single URL string. The callback parameter lets you pass a function that will be called with each Response object as the requests complete. This is where you would handle extracting the data from each page.

Utility Methods

The SDK also provides a few helper methods for common scraping needs:

  • client.search – Get search engine results from Google or Bing
  • client.extract_metadata – Extract SEO, Open Graph, JSON+LD metadata
  • client.extract_links – Extract unique hyperlinks from a page
  • client.create_scraper – Scrape and extract data based on CSS selectors

For example, here‘s how you can use the extract_links method to scrape all the URLs from a page:

from scrapingbee import ScrapingBeeClient
from pprint import pprint

client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)

response = client.extract_links(
    url=‘https://example.com‘,
    params={‘extract_links‘: True}
)

pprint(response.json())

This will return a JSON object containing all the unique <a href> links found on the page, separated into internal and external links.

Conclusion

As you can see, the ScrapingBee Python SDK makes it really easy to integrate their web scraping API into your code. With just a few lines, you can scrape even the most complex, bot-protected websites.

Some of the key things to remember:

  • Install the SDK with pip install scrapingbee
  • Get your API key from the ScrapingBee dashboard
  • Initialize the ScrapingBeeClient with your API key
  • Use the get method to send requests and scrape web pages
  • Pass params to configure things like JS rendering, proxies, CAPTCHAs, etc.
  • Use BeautifulSoup or your favorite HTML parsing library to extract data
  • Enable concurrency for faster bulk scraping

The ScrapingBee docs have a ton more info about all the different features and options available. Be sure to check those out as you start using the API for your own projects.

You can find the full source code for all the examples used in this guide, along with installation instructions, in the GitHub repository.

Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *