Web scraping is a powerful technique to extract large amounts of data from websites. However, scraping data on demand can be challenging. In this comprehensive guide, we‘ll learn how to turn a Python web scraper into a real-time data API using FastAPI.
Why Build a Scraping API?
Here are some key benefits of delivering scraped data through an API:
-
Real-time data – APIs can scrape data on demand rather than relying on stale, cached batches.
-
Custom queries – APIs allow custom querying of data rather than just dumping entire websites.
-
Scalability – APIs handle traffic spikes and scale to thousands of users.
-
Reliability – APIs retry failed requests and implement robust scraping logic.
-
Flexibility – APIs can implement various data delivery methods like webhooks.
-
Abstraction – APIs hide complex scraping logic and infrastructure from API consumers.
Scraping API Architecture
At a high level, a web scraping API follows this architecture:
- API Receives Request
- Scraper Fetches Data
- Data Gets Cached
- API Returns Scraped Data
The key components are:
-
API Server – Handles client requests and delivers scraped data.
-
Scraper – Fetches data on demand from target sites.
-
Cache – Stores scraped data to avoid redundant scraping.
-
Database – Optionally stores scraped data for historical analysis.
Why Use FastAPI?
There are many excellent API frameworks out there. For scrapers, I recommend FastAPI because:
-
FastAPI is very fast – perfect for data scraping APIs.
-
It provides automatic docs, validation, serialization, etc.
-
Supports asyncio for async scraping.
-
Simple to use and learn. Flexible for small to large APIs.
-
Has great ecosystem of scraping tools like httpx, parsel, etc.
Scraping API Setup
We‘ll use the following core libraries:
-
FastAPI – Our API framework
-
httpx – Asynchronous HTTP client
-
parsel – HTML/XML parser
-
loguru – Logging utility
Install the packages:
pip install fastapi uvicorn httpx parsel loguru
For this guide, we‘ll scrape some basic stock data from Yahoo Finance.
Create the API
Let‘s setup the initial FastAPI app with a single endpoint:
from fastapi import FastAPI
app = FastAPI()
@app.get("/stock/{symbol}")
async def get_stock(symbol: str):
return { "stock": symbol }
This basic API just returns the stock symbol we provide.
Let‘s start the server and test it:
uvicorn main:app --reload
import httpx
print(httpx.get("http://localhost:8000/stock/AAPL").json())
# {‘stock‘: ‘AAPL‘}
Our API is ready to receive requests. Next let‘s make it scrape some data.
Scraping Stock Data
To get a stock‘s data we‘ll:
- Build the Yahoo Finance URL from symbol
- Fetch the HTML page
- Parse values using XPath
from parsel import Selector # for xpath parsing
async def scrape_stock(symbol: str):
url = f"https://finance.yahoo.com/quote/{symbol}"
async with httpx.AsyncClient() as client:
response = await client.get(url)
sel = Selector(response.text)
# parse summary values
values = sel.xpath(‘//div[contains(@data-test,"summary-table")]//tr‘)
data = {}
for value in values:
label = value.xpath("./td[1]/text()").get()
val = value.xpath("./td[2]/text()").get()
data[label] = val
# parse price
data["price"] = sel.xpath(‘//fin-streamer[@data-symbol=$symbol]/@value‘, symbol=symbol).get()
return data
This scraper returns a dictionary containing the parsed data. Let‘s connect it to our API:
from fastapi import FastAPI
from yahoo_finance import scrape_stock
app = FastAPI()
@app.get("/stock/{symbol}")
async def get_stock(symbol: str):
data = await scrape_stock(symbol)
return data
Now when we call our API, it‘ll fetch the latest data:
http http://localhost:8000/stock/AAPL
HTTP/1.1 200 OK
Content-Length: 340
{
"52 Week Range": "142.00 - 182.94",
"Beta (5Y Monthly)": "1.25",
"Diluted EPS (ttm)": "6.05",
"Earnings Date": "Oct 27, 2022",
"Ex-Dividend Date": "Aug 05, 2022",
"Forward Dividend & Yield": "0.92 (0.59%)",
"Market Cap": "2.44T",
"Open": "156.76",
"PE Ratio (ttm)": "25.60",
"Previous Close": "153.72",
"Price": "155.33",
"Volume": "53,978,024"
}
Adding Caching
Scraping on every request is wasteful. Let‘s add caching so we only scrape a stock once every 5 minutes.
We‘ll use a simple dict
to store the scraped data keyed by stock symbol:
STOCK_CACHE = {}
async def scrape_stock(symbol):
if symbol in STOCK_CACHE:
return STOCK_CACHE[symbol]
data = ... # scrape
STOCK_CACHE[symbol] = data
return data
Now repeated requests will return cached data instead of scraping every time.
We can also periodically clear old cache:
import time
CACHE_MAX_AGE = 300 # seconds
async def clear_expired_cache():
curr_time = time.time()
for symbol, data in STOCK_CACHE.items():
if curr_time - data["ts"] > CACHE_MAX_AGE:
del STOCK_CACHE[symbol]
# run every 5 minutes
clear_cache_task = asyncio.create_task(clear_expired_cache())
This ensures our cache won‘t grow unbounded.
Adding Webhooks
For long scraping jobs, we can use webhooks to return results asynchronously:
@app.get("/stock/{symbol}")
async def get_stock(symbol: str, webhook: str):
if webhook:
task = asyncio.create_task(fetch_stock(symbol, webhook))
return {"msg": "Fetching data"}
data = await scrape_stock(symbol)
return data
async def fetch_stock(symbol, url):
data = await scrape_stock(symbol)
async with httpx.AsyncClient() as client:
await client.post(url, json=data)
Now instead of waiting for scraping to complete, our API will immediately return a status and deliver the data asynchronously to the callback webhook.
We can test this using a tool like Webhook.site.
Scaling the Scraping API
As traffic increases, here are some scaling techniques:
-
Add API caching – Use a cache like Redis to reduce scraping load.
-
Run multiple processes – Scale across cores/servers with gunicorn.
-
Offload scraping – Move scrapers to workers like Celery or RabbitMQ.
-
Use scraping services – Leverage scraping APIs like Scrapfly or ScraperAPI.
-
Optimize scrapers – Ensure scrapers are efficient and avoid bans.
-
Add databases – Store scraped data in databases for further analysis.
Conclusion
In this guide we built a web scraping API in Python using FastAPI. The key takeaways are:
-
FastAPI provides an excellent framework for scraping APIs.
-
We can generate scrapers to fetch data on demand.
-
Caching and webhooks help overcome scraping limitations.
-
There are many optimization and scaling strategies as traffic grows.
Scraping APIs unlock the wealth of data on websites. By generating APIs instead of static scraping, we can deliver low-latency, customized data at scale.