Web scraping is a powerful technique to extract large amounts of data from websites. However, scraping data on demand can be challenging. In this comprehensive guide, we‘ll learn how to turn a Python web scraper into a real-time data API using FastAPI.
Why Build a Scraping API?
Here are some key benefits of delivering scraped data through an API:
Real-time data – APIs can scrape data on demand rather than relying on stale, cached batches.
Custom queries – APIs allow custom querying of data rather than just dumping entire websites.
Scalability – APIs handle traffic spikes and scale to thousands of users.
Reliability – APIs retry failed requests and implement robust scraping logic.
Flexibility – APIs can implement various data delivery methods like webhooks.
Abstraction – APIs hide complex scraping logic and infrastructure from API consumers.
Scraping API Architecture
At a high level, a web scraping API follows this architecture:
- API Receives Request
- Scraper Fetches Data
- Data Gets Cached
- API Returns Scraped Data
The key components are:
API Server – Handles client requests and delivers scraped data.
Scraper – Fetches data on demand from target sites.
Cache – Stores scraped data to avoid redundant scraping.
Database – Optionally stores scraped data for historical analysis.
Why Use FastAPI?
There are many excellent API frameworks out there. For scrapers, I recommend FastAPI because:
FastAPI is very fast – perfect for data scraping APIs.
It provides automatic docs, validation, serialization, etc.
Supports asyncio for async scraping.
Simple to use and learn. Flexible for small to large APIs.
Has great ecosystem of scraping tools like httpx, parsel, etc.
Scraping API Setup
We‘ll use the following core libraries:
FastAPI – Our API framework
httpx – Asynchronous HTTP client
parsel – HTML/XML parser
loguru – Logging utility
Install the packages:
pip install fastapi uvicorn httpx parsel loguru
For this guide, we‘ll scrape some basic stock data from Yahoo Finance.
Create the API
Let‘s setup the initial FastAPI app with a single endpoint:
from fastapi import FastAPI
app = FastAPI()
@app.get("/stock/{symbol}")
async def get_stock(symbol: str):
return { "stock": symbol }
This basic API just returns the stock symbol we provide.
Let‘s start the server and test it:
uvicorn main:app --reload
import httpx
print(httpx.get("http://localhost:8000/stock/AAPL").json())
# {‘stock‘: ‘AAPL‘}
Our API is ready to receive requests. Next let‘s make it scrape some data.
Scraping Stock Data
To get a stock‘s data we‘ll:
- Build the Yahoo Finance URL from symbol
- Fetch the HTML page
- Parse values using XPath
from parsel import Selector # for xpath parsing
async def scrape_stock(symbol: str):
url = f"https://finance.yahoo.com/quote/{symbol}"
async with httpx.AsyncClient() as client:
response = await client.get(url)
sel = Selector(response.text)
# parse summary values
values = sel.xpath(‘//div[contains(@data-test,"summary-table")]//tr‘)
data = {}
for value in values:
label = value.xpath("./td[1]/text()").get()
val = value.xpath("./td[2]/text()").get()
data[label] = val
# parse price
data["price"] = sel.xpath(‘//fin-streamer[@data-symbol=$symbol]/@value‘, symbol=symbol).get()
return data
This scraper returns a dictionary containing the parsed data. Let‘s connect it to our API:
from fastapi import FastAPI
from yahoo_finance import scrape_stock
app = FastAPI()
@app.get("/stock/{symbol}")
async def get_stock(symbol: str):
data = await scrape_stock(symbol)
return data
Now when we call our API, it‘ll fetch the latest data:
http http://localhost:8000/stock/AAPL
HTTP/1.1 200 OK
Content-Length: 340
{
"52 Week Range": "142.00 - 182.94",
"Beta (5Y Monthly)": "1.25",
"Diluted EPS (ttm)": "6.05",
"Earnings Date": "Oct 27, 2022",
"Ex-Dividend Date": "Aug 05, 2022",
"Forward Dividend & Yield": "0.92 (0.59%)",
"Market Cap": "2.44T",
"Open": "156.76",
"PE Ratio (ttm)": "25.60",
"Previous Close": "153.72",
"Price": "155.33",
"Volume": "53,978,024"
}
Adding Caching
Scraping on every request is wasteful. Let‘s add caching so we only scrape a stock once every 5 minutes.
We‘ll use a simple dict
to store the scraped data keyed by stock symbol:
STOCK_CACHE = {}
async def scrape_stock(symbol):
if symbol in STOCK_CACHE:
return STOCK_CACHE[symbol]
data = ... # scrape
STOCK_CACHE[symbol] = data
return data
Now repeated requests will return cached data instead of scraping every time.
We can also periodically clear old cache:
import time
CACHE_MAX_AGE = 300 # seconds
async def clear_expired_cache():
curr_time = time.time()
for symbol, data in STOCK_CACHE.items():
if curr_time - data["ts"] > CACHE_MAX_AGE:
del STOCK_CACHE[symbol]
# run every 5 minutes
clear_cache_task = asyncio.create_task(clear_expired_cache())
This ensures our cache won‘t grow unbounded.
Adding Webhooks
For long scraping jobs, we can use webhooks to return results asynchronously:
@app.get("/stock/{symbol}")
async def get_stock(symbol: str, webhook: str):
if webhook:
task = asyncio.create_task(fetch_stock(symbol, webhook))
return {"msg": "Fetching data"}
data = await scrape_stock(symbol)
return data
async def fetch_stock(symbol, url):
data = await scrape_stock(symbol)
async with httpx.AsyncClient() as client:
await client.post(url, json=data)
Now instead of waiting for scraping to complete, our API will immediately return a status and deliver the data asynchronously to the callback webhook.
We can test this using a tool like Webhook.site.
Scaling the Scraping API
As traffic increases, here are some scaling techniques:
Add API caching – Use a cache like Redis to reduce scraping load.
Run multiple processes – Scale across cores/servers with gunicorn.
Offload scraping – Move scrapers to workers like Celery or RabbitMQ.
Use scraping services – Leverage scraping APIs like Scrapfly or ScraperAPI.
Optimize scrapers – Ensure scrapers are efficient and avoid bans.
Add databases – Store scraped data in databases for further analysis.
Conclusion
In this guide we built a web scraping API in Python using FastAPI. The key takeaways are:
FastAPI provides an excellent framework for scraping APIs.
We can generate scrapers to fetch data on demand.
Caching and webhooks help overcome scraping limitations.
There are many optimization and scaling strategies as traffic grows.
Scraping APIs unlock the wealth of data on websites. By generating APIs instead of static scraping, we can deliver low-latency, customized data at scale.