How to Turn Web Scrapers into Data APIs

Web scraping is a powerful technique to extract large amounts of data from websites. However, scraping data on demand can be challenging. In this comprehensive guide, we‘ll learn how to turn a Python web scraper into a real-time data API using FastAPI.

Why Build a Scraping API?

Here are some key benefits of delivering scraped data through an API:

Real-time data – APIs can scrape data on demand rather than relying on stale, cached batches.
Custom queries – APIs allow custom querying of data rather than just dumping entire websites.
Scalability – APIs handle traffic spikes and scale to thousands of users.
Reliability – APIs retry failed requests and implement robust scraping logic.
Flexibility – APIs can implement various data delivery methods like webhooks.
Abstraction – APIs hide complex scraping logic and infrastructure from API consumers.

Scraping API Architecture

At a high level, a web scraping API follows this architecture:

API Receives Request
Scraper Fetches Data
Data Gets Cached
API Returns Scraped Data

The key components are:

API Server – Handles client requests and delivers scraped data.
Scraper – Fetches data on demand from target sites.
Cache – Stores scraped data to avoid redundant scraping.
Database – Optionally stores scraped data for historical analysis.

Why Use FastAPI?

There are many excellent API frameworks out there. For scrapers, I recommend FastAPI because:

FastAPI is very fast – perfect for data scraping APIs.
It provides automatic docs, validation, serialization, etc.
Supports asyncio for async scraping.
Simple to use and learn. Flexible for small to large APIs.
Has great ecosystem of scraping tools like httpx, parsel, etc.

Scraping API Setup

We‘ll use the following core libraries:

FastAPI – Our API framework
httpx – Asynchronous HTTP client
parsel – HTML/XML parser
loguru – Logging utility

Install the packages:

pip install fastapi uvicorn httpx parsel loguru

For this guide, we‘ll scrape some basic stock data from Yahoo Finance.

Create the API

Let‘s setup the initial FastAPI app with a single endpoint:

from fastapi import FastAPI

app = FastAPI() 

@app.get("/stock/{symbol}")  
async def get_stock(symbol: str):
    return { "stock": symbol }

This basic API just returns the stock symbol we provide.

Let‘s start the server and test it:

uvicorn main:app --reload

import httpx

print(httpx.get("http://localhost:8000/stock/AAPL").json())

# {‘stock‘: ‘AAPL‘}

Our API is ready to receive requests. Next let‘s make it scrape some data.

Scraping Stock Data

To get a stock‘s data we‘ll:

Build the Yahoo Finance URL from symbol
Fetch the HTML page
Parse values using XPath

from parsel import Selector # for xpath parsing

async def scrape_stock(symbol: str):

    url = f"https://finance.yahoo.com/quote/{symbol}"

    async with httpx.AsyncClient() as client:
       response = await client.get(url)

    sel = Selector(response.text)

    # parse summary values 
    values = sel.xpath(‘//div[contains(@data-test,"summary-table")]//tr‘)
    data = {}
    for value in values:
        label = value.xpath("./td[1]/text()").get()
        val = value.xpath("./td[2]/text()").get()
        data[label] = val

    # parse price 
    data["price"] = sel.xpath(‘//fin-streamer[@data-symbol=$symbol]/@value‘, symbol=symbol).get()

    return data

This scraper returns a dictionary containing the parsed data. Let‘s connect it to our API:

from fastapi import FastAPI
from yahoo_finance import scrape_stock

app = FastAPI()

@app.get("/stock/{symbol}")
async def get_stock(symbol: str):
    data = await scrape_stock(symbol) 
    return data

Now when we call our API, it‘ll fetch the latest data:

http http://localhost:8000/stock/AAPL

HTTP/1.1 200 OK
Content-Length: 340

{
  "52 Week Range": "142.00 - 182.94",
  "Beta (5Y Monthly)": "1.25", 
  "Diluted EPS (ttm)": "6.05",
  "Earnings Date": "Oct 27, 2022",
  "Ex-Dividend Date": "Aug 05, 2022",
  "Forward Dividend & Yield": "0.92 (0.59%)",
  "Market Cap": "2.44T",
  "Open": "156.76",
  "PE Ratio (ttm)": "25.60",
  "Previous Close": "153.72",
  "Price": "155.33",  
  "Volume": "53,978,024"
}

Adding Caching

Scraping on every request is wasteful. Let‘s add caching so we only scrape a stock once every 5 minutes.

We‘ll use a simple dict to store the scraped data keyed by stock symbol:

STOCK_CACHE = {} 

async def scrape_stock(symbol):

   if symbol in STOCK_CACHE:
       return STOCK_CACHE[symbol] 

   data = ... # scrape 
   STOCK_CACHE[symbol] = data

   return data

Now repeated requests will return cached data instead of scraping every time.

We can also periodically clear old cache:

import time

CACHE_MAX_AGE = 300 # seconds

async def clear_expired_cache():
   curr_time = time.time()
   for symbol, data in STOCK_CACHE.items():
       if curr_time - data["ts"] > CACHE_MAX_AGE:
           del STOCK_CACHE[symbol]

# run every 5 minutes   
clear_cache_task = asyncio.create_task(clear_expired_cache())

This ensures our cache won‘t grow unbounded.

Adding Webhooks

For long scraping jobs, we can use webhooks to return results asynchronously:

@app.get("/stock/{symbol}")
async def get_stock(symbol: str, webhook: str):

   if webhook:
      task = asyncio.create_task(fetch_stock(symbol, webhook))
      return {"msg": "Fetching data"}

   data = await scrape_stock(symbol)  
   return data

async def fetch_stock(symbol, url):
   data = await scrape_stock(symbol)

   async with httpx.AsyncClient() as client:
      await client.post(url, json=data)

Now instead of waiting for scraping to complete, our API will immediately return a status and deliver the data asynchronously to the callback webhook.

We can test this using a tool like Webhook.site.

Scaling the Scraping API

As traffic increases, here are some scaling techniques:

Add API caching – Use a cache like Redis to reduce scraping load.
Run multiple processes – Scale across cores/servers with gunicorn.
Offload scraping – Move scrapers to workers like Celery or RabbitMQ.
Use scraping services – Leverage scraping APIs like Scrapfly or ScraperAPI.
Optimize scrapers – Ensure scrapers are efficient and avoid bans.
Add databases – Store scraped data in databases for further analysis.

Conclusion

In this guide we built a web scraping API in Python using FastAPI. The key takeaways are:

FastAPI provides an excellent framework for scraping APIs.
We can generate scrapers to fetch data on demand.
Caching and webhooks help overcome scraping limitations.
There are many optimization and scaling strategies as traffic grows.

Scraping APIs unlock the wealth of data on websites. By generating APIs instead of static scraping, we can deliver low-latency, customized data at scale.

Why Build a Scraping API?

Scraping API Architecture

Why Use FastAPI?

Scraping API Setup

Create the API

Scraping Stock Data

Adding Caching

Adding Webhooks

Scaling the Scraping API

Conclusion

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python