Skip to content

How to Turn Web Scrapers into Data APIs

Web scraping is a powerful technique to extract large amounts of data from websites. However, scraping data on demand can be challenging. In this comprehensive guide, we‘ll learn how to turn a Python web scraper into a real-time data API using FastAPI.

Why Build a Scraping API?

Here are some key benefits of delivering scraped data through an API:

  • Real-time data – APIs can scrape data on demand rather than relying on stale, cached batches.

  • Custom queries – APIs allow custom querying of data rather than just dumping entire websites.

  • Scalability – APIs handle traffic spikes and scale to thousands of users.

  • Reliability – APIs retry failed requests and implement robust scraping logic.

  • Flexibility – APIs can implement various data delivery methods like webhooks.

  • Abstraction – APIs hide complex scraping logic and infrastructure from API consumers.

Scraping API Architecture

At a high level, a web scraping API follows this architecture:

Scraping API Architecture

  1. API Receives Request
  2. Scraper Fetches Data
  3. Data Gets Cached
  4. API Returns Scraped Data

The key components are:

  • API Server – Handles client requests and delivers scraped data.

  • Scraper – Fetches data on demand from target sites.

  • Cache – Stores scraped data to avoid redundant scraping.

  • Database – Optionally stores scraped data for historical analysis.

Why Use FastAPI?

There are many excellent API frameworks out there. For scrapers, I recommend FastAPI because:

  • FastAPI is very fast – perfect for data scraping APIs.

  • It provides automatic docs, validation, serialization, etc.

  • Supports asyncio for async scraping.

  • Simple to use and learn. Flexible for small to large APIs.

  • Has great ecosystem of scraping tools like httpx, parsel, etc.

Scraping API Setup

We‘ll use the following core libraries:

  • FastAPI – Our API framework

  • httpx – Asynchronous HTTP client

  • parsel – HTML/XML parser

  • loguru – Logging utility

Install the packages:

pip install fastapi uvicorn httpx parsel loguru

For this guide, we‘ll scrape some basic stock data from Yahoo Finance.

Create the API

Let‘s setup the initial FastAPI app with a single endpoint:

from fastapi import FastAPI

app = FastAPI() 

@app.get("/stock/{symbol}")  
async def get_stock(symbol: str):
    return { "stock": symbol } 

This basic API just returns the stock symbol we provide.

Let‘s start the server and test it:

uvicorn main:app --reload
import httpx

print(httpx.get("http://localhost:8000/stock/AAPL").json())

# {‘stock‘: ‘AAPL‘}

Our API is ready to receive requests. Next let‘s make it scrape some data.

Scraping Stock Data

To get a stock‘s data we‘ll:

  1. Build the Yahoo Finance URL from symbol
  2. Fetch the HTML page
  3. Parse values using XPath
from parsel import Selector # for xpath parsing

async def scrape_stock(symbol: str):

    url = f"https://finance.yahoo.com/quote/{symbol}"

    async with httpx.AsyncClient() as client:
       response = await client.get(url)

    sel = Selector(response.text)

    # parse summary values 
    values = sel.xpath(‘//div[contains(@data-test,"summary-table")]//tr‘)
    data = {}
    for value in values:
        label = value.xpath("./td[1]/text()").get()
        val = value.xpath("./td[2]/text()").get()
        data[label] = val

    # parse price 
    data["price"] = sel.xpath(‘//fin-streamer[@data-symbol=$symbol]/@value‘, symbol=symbol).get()

    return data

This scraper returns a dictionary containing the parsed data. Let‘s connect it to our API:

from fastapi import FastAPI
from yahoo_finance import scrape_stock

app = FastAPI()

@app.get("/stock/{symbol}")
async def get_stock(symbol: str):
    data = await scrape_stock(symbol) 
    return data

Now when we call our API, it‘ll fetch the latest data:

http http://localhost:8000/stock/AAPL

HTTP/1.1 200 OK
Content-Length: 340

{
  "52 Week Range": "142.00 - 182.94",
  "Beta (5Y Monthly)": "1.25", 
  "Diluted EPS (ttm)": "6.05",
  "Earnings Date": "Oct 27, 2022",
  "Ex-Dividend Date": "Aug 05, 2022",
  "Forward Dividend & Yield": "0.92 (0.59%)",
  "Market Cap": "2.44T",
  "Open": "156.76",
  "PE Ratio (ttm)": "25.60",
  "Previous Close": "153.72",
  "Price": "155.33",  
  "Volume": "53,978,024"
}

Adding Caching

Scraping on every request is wasteful. Let‘s add caching so we only scrape a stock once every 5 minutes.

We‘ll use a simple dict to store the scraped data keyed by stock symbol:

STOCK_CACHE = {} 

async def scrape_stock(symbol):

   if symbol in STOCK_CACHE:
       return STOCK_CACHE[symbol] 

   data = ... # scrape 
   STOCK_CACHE[symbol] = data

   return data

Now repeated requests will return cached data instead of scraping every time.

We can also periodically clear old cache:

import time

CACHE_MAX_AGE = 300 # seconds

async def clear_expired_cache():
   curr_time = time.time()
   for symbol, data in STOCK_CACHE.items():
       if curr_time - data["ts"] > CACHE_MAX_AGE:
           del STOCK_CACHE[symbol]

# run every 5 minutes   
clear_cache_task = asyncio.create_task(clear_expired_cache()) 

This ensures our cache won‘t grow unbounded.

Adding Webhooks

For long scraping jobs, we can use webhooks to return results asynchronously:

@app.get("/stock/{symbol}")
async def get_stock(symbol: str, webhook: str):

   if webhook:
      task = asyncio.create_task(fetch_stock(symbol, webhook))
      return {"msg": "Fetching data"}

   data = await scrape_stock(symbol)  
   return data

async def fetch_stock(symbol, url):
   data = await scrape_stock(symbol)

   async with httpx.AsyncClient() as client:
      await client.post(url, json=data)

Now instead of waiting for scraping to complete, our API will immediately return a status and deliver the data asynchronously to the callback webhook.

We can test this using a tool like Webhook.site.

Scaling the Scraping API

As traffic increases, here are some scaling techniques:

  • Add API caching – Use a cache like Redis to reduce scraping load.

  • Run multiple processes – Scale across cores/servers with gunicorn.

  • Offload scraping – Move scrapers to workers like Celery or RabbitMQ.

  • Use scraping services – Leverage scraping APIs like Scrapfly or ScraperAPI.

  • Optimize scrapers – Ensure scrapers are efficient and avoid bans.

  • Add databases – Store scraped data in databases for further analysis.

Conclusion

In this guide we built a web scraping API in Python using FastAPI. The key takeaways are:

  • FastAPI provides an excellent framework for scraping APIs.

  • We can generate scrapers to fetch data on demand.

  • Caching and webhooks help overcome scraping limitations.

  • There are many optimization and scaling strategies as traffic grows.

Scraping APIs unlock the wealth of data on websites. By generating APIs instead of static scraping, we can deliver low-latency, customized data at scale.

Tags:

Join the conversation

Your email address will not be published. Required fields are marked *