How to Build a Powerful News Crawler with Python and the ScrapingBee API

As a developer, you may find yourself needing to aggregate news headlines from multiple sources on a regular basis. Maybe you want to stay on top of the latest stories in your industry, or perhaps you‘re building a news reader application for others to use. Either way, manually checking dozens of news sites each day is tedious and time-consuming.

The solution is to automate the process by building a web crawler that can scrape the latest headlines for you. A web crawler, also known as a spider bot, is a program that systematically browses websites and extracts data, following links from page to page.

In this in-depth tutorial, you‘ll learn how to build your own news crawling bot using Python. We‘ll leverage the power of the ScrapingBee API to handle the complexities of web scraping, and create a web application with Flask to display the aggregated headlines.

Here‘s what we‘ll cover:

Setting up your Python development environment
Identifying news sites to crawl and inspecting their HTML
Scraping headlines with the ScrapingBee API
Building a basic command line tool to fetch and display headlines
Creating a Flask web app to show news on a page
Scheduling the scraper to run automatically on an interval
Deploying your news bot
Tips and best practices for web scraping

By the end of this guide, you‘ll have a fully functional news crawler that you can adapt and expand to suit your needs. Let‘s get started!

Prerequisites

Before we dive in, make sure you have the following:

Python 3.6 or higher installed
A free ScrapingBee API account (sign up at https://app.scrapingbee.com/register)
Basic knowledge of Python and HTML

Setting Up Your Environment

To kick things off, let‘s create a new directory for our project and set up a virtual environment to manage our Python dependencies.

Open up your terminal and run:

mkdir news-crawler
cd news-crawler

This creates a directory called news-crawler and navigates into it.

Next, create a virtual environment named .venv using the venv module:

python3 -m venv .venv

Activate the virtual environment:

source .venv/bin/activate

Your terminal prompt should now show (.venv), indicating the virtual environment is active.

Now we can install the libraries we need for this project:

pip install scrapingbee flask apscheduler

This will install:

scrapingbee – The official ScrapingBee client library
flask – A popular lightweight web framework
apscheduler – A library for scheduling jobs

We‘re ready to start building our crawler!

Configuring ScrapingBee

The ScrapingBee API makes it easy to scrape websites without having to deal with common challenges like IP rotation, CAPTCHAs, and inconsistent HTML rendering.

To use it, you‘ll need an API key. Sign into the ScrapingBee dashboard at https://app.scrapingbee.com/login and find your API key under the API Key Management section.

Once you have your API key, set a SCRAPINGBEE_API_KEY environment variable containing its value:

export SCRAPINGBEE_API_KEY=your_api_key_here

Replace your_api_key_here with your real key. This will allow us to securely access the key from our Python code without hard-coding it.

Planning the Crawl Targets

For this example, let‘s crawl headlines from 3 major news websites:

Before we start scraping, we need to inspect each site to determine how to extract just the headlines. ScrapingBee uses "data extraction rules" to target specific elements on the page.

To get the rules, open up each URL in your browser, right-click a headline, and select "Inspect" to open the developer tools.

You should see the HTML for that headline highlighted in the "Elements" panel. It shows the element type (e.g. <h2>), plus any HTML attributes like the class name. We can use these to construct a CSS selector that uniquely identifies the headline.

For example, on CNN.com the headlines seem to consistently have a class of .cd__headline. So our rule to get CNN headlines would look like:

{
  "headlines": {
    "selector": ".cd__headline",
    "type": "list",
    "output": "text"
  }
}

This tells ScrapingBee to find all elements matching the .cd__headline selector, return them as a list, and extract just the inner text content.

Following this process for the other two sites, I determined these rules:

Fox News:

{
  "headlines": {
    "selector": "main h2.title",
    "type": "list", 
    "output": "text"
  }
}

Reuters:

{
  "headlines": {
    "selector": ".story-content a .text__text__1FZLe",
    "type": "list",
    "output": "text" 
  }
}

With our data extraction rules defined, we‘re ready to start writing code!

Building the Basic Crawler

Create a new file called crawler.py in your news-crawler directory:

touch crawler.py

Open it in your preferred code editor and add the following:

import os
from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key=os.environ[‘SCRAPINGBEE_API_KEY‘])

CNN_URL = ‘https://www.cnn.com‘
FOX_URL = ‘https://www.foxnews.com‘ 
REUTERS_URL = ‘https://www.reuters.com‘

cnn_rules = {
  "headlines": {
    "selector": ".cd__headline",
    "type": "list",
    "output": "text"
  }
}

fox_rules = {
  "headlines": {
    "selector": "main h2.title",
    "type": "list", 
    "output": "text"
  }
}

reuters_rules = {
  "headlines": {
    "selector": ".story-content a .text__text__1FZLe",
    "type": "list",
    "output": "text" 
  }
}

def fetch_headlines(url, rules):
    response = client.get(url, params={
      "extract_rules": rules    
    })

    return response.json()[‘headlines‘]

print("=== CNN ===")
print(fetch_headlines(CNN_URL, cnn_rules))

print("=== FOX ===") 
print(fetch_headlines(FOX_URL, fox_rules))

print("=== Reuters ===")
print(fetch_headlines(REUTERS_URL, reuters_rules))

Let‘s break this down:

First we import the os module to access environment variables, and the ScrapingBeeClient class from the scrapingbee package.

We create a new ScrapingBeeClient instance, passing in our API key from the SCRAPINGBEE_API_KEY environment variable.

Next, we define some constants with the URLs of the news sites we want to scrape, and our data extraction rules for each one.

The fetch_headlines function is where the magic happens. It takes a URL and extraction rules as parameters.

It calls the get method on the ScrapingBeeClient to make an HTTP request to the given URL. We pass the extraction rules in the params option.

Finally, it returns the extracted headlines from the JSON response.

The last part of the script prints out the headlines from each news source.

Let‘s test it out:

python crawler.py

You should see a list of latest headlines from CNN, Fox News and Reuters print out in your terminal!

This is a great start, but viewing headlines in the terminal is not very user-friendly. Let‘s display them on a web page instead using Flask.

Serving Headlines on a Web Page

Create a new file called app.py:

touch app.py

Here‘s the code to display our crawled headlines on a web page:

from datetime import datetime
import sched, time
from flask import Flask, render_template

from crawler import fetch_headlines, CNN_URL, FOX_URL, REUTERS_URL, cnn_rules, fox_rules, reuters_rules

app = Flask(__name__)

def crawl():
    headlines = {
        "cnn": fetch_headlines(CNN_URL, cnn_rules),
        "fox": fetch_headlines(FOX_URL, fox_rules),
        "reuters": fetch_headlines(REUTERS_URL, reuters_rules)
    }

    return headlines

@app.route(‘/‘)
def index():
    headlines = crawl()
    return render_template(‘index.html‘, headlines=headlines, last_update=datetime.now())

if __name__ == ‘__main__‘:
    app.run()

We import the fetch_headlines function and other constants from our crawler.py module. We also import Flask and other dependencies.

The crawl function invokes fetch_headlines for each of our news sources and returns a dictionary mapping the source name to its headlines list.

The index route calls crawl to get the latest headlines and renders them using an index.html template. It also passes the current datetime to display when the headlines were last updated.

Finally, we use app.run() to start the Flask development server when this script is run directly.

Create a templates directory and add an index.html file inside it:

mkdir templates
touch templates/index.html

Add the following to templates/index.html:

<!doctype html>
<html>
  <head>
    <title>Top Headlines</title>
  </head>

  <body>

    <p>Last updated: {{ last_update }}</p>

    <h2>CNN</h2>
    <ul>
    {% for headline in headlines[‘cnn‘] %}
      <li>{{ headline }}</li>
    {% endfor %}
    </ul>

    <h2>Fox News</h2>
    <ul>
    {% for headline in headlines[‘fox‘] %}  
      <li>{{ headline }}</li>
    {% endfor %}
    </ul>

    <h2>Reuters</h2>
    <ul>
    {% for headline in headlines[‘reuters‘] %}
      <li>{{ headline }}</li>  
    {% endfor %}
    </ul>

  </body>
</html>

This is a simple HTML page that loops through the headlines dictionary passed from the index route and displays each one in an unordered list under a heading for the news source. It also displays the last updated time from the last_update variable.

Start the app:

python app.py

Visit http://localhost:5000 in your browser. You should see your top headlines from CNN, Fox News and Reuters laid out on a page!

While this is great progress, our app only fetches headlines when a user visits the page. Ideally, we want the headlines to refresh automatically at a set interval.

Crawling Periodically

To make our crawler run on a schedule, we‘ll use the APScheduler library.

Add the following to app.py, above the if __name__ == ‘__main__‘: block:

from apscheduler.schedulers.background import BackgroundScheduler

headlines = crawl()

scheduler = BackgroundScheduler()
scheduler.add_job(func=crawl, trigger="interval", seconds=60)
scheduler.start()

First, we perform an initial crawl when the app starts up and store the headlines in a global headlines variable.

We create a BackgroundScheduler instance and use its add_job method to schedule the crawl function to run every 60 seconds in the background.

Update the index route to use the global headlines variable instead of calling crawl directly:

@app.route(‘/‘)
def index():
    global headlines
    return render_template(‘index.html‘, headlines=headlines, last_update=datetime.now())

That‘s it! Our news crawler is now complete. It fetches the latest top headlines every minute and displays them on a web page.

Deployment

To make your news crawler accessible to others, you‘ll need to deploy it to a web server.

One easy option is to use a Platform as a Service (PaaS) provider like Heroku or PythonAnywhere. These allow you to deploy Python apps with minimal configuration.

Refer to their documentation for specific instructions, but in general the process involves:

Creating a free account
Installing the provider‘s CLI tool
Initializing a Git repository for your project
Adding a requirements.txt file listing your Python dependencies
Adding a Procfile specifying your app‘s startup command
Pushing your code to the provider
Configuring environment variables (for your API key)

Once deployed, your news crawler will have a public URL you can share. It will keep itself updated with fresh headlines automatically!

Of course, there are many additional features you could add, such as:

Storing headlines in a database
Adding more news sources
Allowing users to choose their sources
Extracting other details like images, dates, tags, etc.
Creating a custom front-end design

Web Scraping Tips

When building a production news crawler, there are some important tips to keep in mind:

Respect website terms of service and robots.txt files. Some sites prohibit scraping.
Limit your crawl rate to avoid impacting the performance of your target websites.
Use caching to avoid re-scraping unchanged pages.
Handle errors and edge cases gracefully. Some headlines may fail to extract.
Monitor your crawler and set up alerts if it stops working.
Rotate user agent strings and IP addresses to avoid triggering anti-bot measures.

ScrapingBee handles many of these issues for you by default, but it‘s still important to scrape ethically and with restraint.

Conclusion

Congratulations! You now have a fully functional news crawler built with Python and powered by the ScrapingBee API. Let‘s recap the key steps:

Create a ScrapingBee account and configure your API key
Inspect your target websites to identify headline selectors
Use the ScrapingBee client library to fetch and extract headlines
Display your scraped headlines on a web page using Flask
Schedule your crawler to fetch new headlines periodically
Deploy your app to a hosting provider

With this foundation in place, you can extend your crawler to cover more sites, extract additional data, and scale it up for production use. The flexibility of ScrapingBee and the Python ecosystem makes it easy to customize your solution.

Web scraping opens up a world of possibilities for aggregating data and keeping tabs on important sources. By automating the tedious parts, you can focus on drawing insights and building valuable tools.

I hope this in-depth tutorial has been helpful for you! Feel free to reach out if you have any other questions.

Happy scraping!

Prerequisites

Setting Up Your Environment

Configuring ScrapingBee

Planning the Crawl Targets

Building the Basic Crawler

Serving Headlines on a Web Page

Crawling Periodically

Deployment

Web Scraping Tips

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide