Web scraping is the process of extracting data from websites automatically. It‘s a common technique used to gather large amounts of data for analysis, machine learning and more. In Python, there are many great libraries that make web scraping easy. One popular option is HTTPX.
HTTPX is a powerful and modern HTTP client for Python. It was created by the developers behind Requests and takes a lot of inspiration from Requests, while also adding new functionality like HTTP/2 support.
In this comprehensive guide, we‘ll explore how to effectively scrape websites in Python using HTTPX.
Getting Started with HTTPX
To get started, HTTPX can be installed via pip:
pip install httpx
Alternatively, Poetry can be used:
poetry add httpx
Once installed, HTTPX can be imported and used:
import httpx
response = httpx.get(‘https://example.com‘)
print(response.text)
This will make a GET request to example.com and print the HTML of the homepage.
HTTPX has a simple API and supports all popular HTTP verbs like GET, POST, PUT, DELETE, HEAD, OPTIONS, etc.
Some key features include:
- HTTP/1.1 and HTTP/2 support
- Asynchronous aiohttp-style API
- Connection pooling and keepalive
- Proxy support
- Timeout configuration
- Cookie persistence
- Familiar requests-style API
Next let‘s look at some common usage patterns.
Making Requests with HTTPX
To make a GET request, the httpx.get()
method can be used:
response = httpx.get(‘https://example.com‘)
Similarly, httpx.post()
, httpx.put()
, httpx.delete()
, etc. can be used for other HTTP verbs.
Parameters like headers, cookies, timeout, etc. can be passed in as keyword arguments:
response = httpx.get(
‘https://httpbin.org/headers‘,
headers = {
‘User-Agent‘: ‘MyBot 1.0‘
},
timeout = 10.0
)
The response contains properties like status_code
, headers
, text
, json()
, etc:
print(response.status_code)
print(response.headers)
print(response.text)
json = response.json()
You can also stream the response incrementally by iterating over the response object.
Using an HTTPX Client
For most scraping tasks, it‘s recommended to use a persistent httpx.Client
instance.
The client handles things like connection pooling, sessions, cookies, etc. across multiple requests.
import httpx
client = httpx.Client()
response = client.get(‘https://example.com‘)
print(response.text)
response = client.get(‘https://httpbin.org/cookies/set?name=foo‘)
print(response.text)
response = client.get(‘https://httpbin.org/cookies‘)
print(response.json())
Here we make multiple requests using the same client, which handles cookie persistence automatically.
You can also configure options like headers, proxies, auth, etc. when creating the client:
client = httpx.Client(
headers = {
‘User-Agent‘: ‘MyBot 1.0‘,
‘Authorization‘: ‘Bearer xxx‘
},
proxies = ‘http://192.168.1.1:8181/‘,
auth = (‘username‘, ‘password‘)
)
Now let‘s look at makings requests asynchronously.
Asynchronous Requests with HTTPX
To make requests asynchronously in Python, HTTPX provides an AsyncClient
:
import httpx
async with httpx.AsyncClient() as client:
response = await client.get(‘https://example.com‘)
We use async with
to initialize the client, await
on the request, and automatically close the client afterwards.
To scrape multiple URLs concurrently, we can use asyncio.gather()
:
import httpx
import asyncio
async def get_url(url):
async with httpx.AsyncClient() as client:
response = await client.get(url)
return response.text
urls = [‘https://example.com‘, ‘https://httpbin.org‘, ‘https://python.org‘]
async def main():
responses = await asyncio.gather(*[get_url(url) for url in urls])
print(responses)
asyncio.run(main())
asyncio.gather()
concurrently awaits multiple coroutines and returns results in the order of the awaitables.
There are also other options like asyncio.as_completed()
to process them as they complete:
tasks = [get_url(url) for url in urls]
async def main():
for result in asyncio.as_completed(tasks):
print(await result)
Async IO enables concurrently fetching multiple pages at once, which is useful for speeding up scraping.
Next let‘s look at scraping data from HTML and JSON responses.
Scraping HTML and JSON Responses
For HTML scraping, we can use a parser like Beautiful Soup to extract data:
from bs4 import BeautifulSoup
response = httpx.get(‘https://en.wikipedia.org/wiki/Python_(programming_language)‘)
soup = BeautifulSoup(response.text, ‘html.parser‘)
for link in soup.select(‘.toctext‘):
print(link.text.strip())
This prints the contents of the Wikipedia page‘s table of contents.
For JSON responses, HTTPX provides a built-in .json()
method:
response = httpx.get(‘https://api.github.com/repos/encode/httpx‘)
json = response.json()
print(json[‘description‘])
The json=
parameter can also be used to serialize JSON data in requests.
Together with a parser, we can build scrapers to extract data from APIs and websites.
Handling Issues and Errors
While scraping, there are often issues like connection errors, timeouts, ratelimits, etc. that come up.
HTTPX provides exceptions and tools to handle them appropriately.
Timeouts
To handle slow responses, you can specify a custom timeout
parameter. The default is 5 seconds.
response = httpx.get(‘https://example.com‘, timeout=10.0)
If this is exceeded, a httpx.TimeoutException
occurs:
try:
response = httpx.get(‘https://example.com‘, timeout=0.001)
except httpx.TimeoutException:
print(‘Request timed out‘)
Longer timeouts may be needed for certain sites or pages.
HTTP Errors
For HTTP errors like 400, 500 etc. the response.raise_for_status()
method can be used:
try:
response = httpx.get(‘https://httpbin.org/status/500‘)
response.raise_for_status()
except httpx.HTTPStatusError:
print(‘HTTP Error occurred‘)
This will raise an exception on any 4xx or 5xx status code.
Retrying Requests
To add retry logic, external packages like tenacity
can be used:
from tenacity import retry, stop_after_attempt
@retry(stop=stop_after_attempt(3))
def make_request():
response = httpx.get(‘https://example.com‘)
response.raise_for_status()
return response.json()
data = make_request()
Here we retry up to 3 times on any exception. More advanced retry logic can also be defined.
Parallel Requests Limits
When making many requests in parallel, you may encounter connection limits.
The limits
parameter can be used to configure options like max connections:
client = httpx.AsyncClient(
limits = httpx.Limits(max_connections=20)
)
Tuning this parameter based on the target site can help avoid limits.
By handling these common issues, more resilient scrapers can be built.
Scraping Best Practices
Here are some tips for creating effective web scrapers with HTTPX:
-
Use an HTTPX Client – Clients provide connection pooling, cookie persistence and other benefits.
-
Scrape Politely – Limit request rates to avoid overwhelming servers. Use random delays and throttling.
-
Handle Errors – Use try/except blocks, status checks, and retries to handle problems.
-
Use Async IO – Scrape pages concurrently to improve speed. But limit concurrency to avoid bans.
-
Randomize User Agents – Rotate random user agent strings to appear more human.
-
Use Proxies – Rotate different proxies/IPs to distribute requests.
-
Cache and Persist Data – Save scraped data to files/databases to avoid re-scraping.
By following best practices like these, more robust and maintainable scrapers can be built.
Advanced Scraping Techniques
Let‘s look at some more advanced scraping capabilities of HTTPX.
Scraping Authentication Pages
To scrape authenticated pages, HTTPX supports multiple auth types like Basic, Digest and Bearer auth:
client = httpx.Client(
auth = (‘username‘, ‘password‘)
)
response = client.get(‘https://api.example.com/users/me‘)
Authentication credentials are persisted across requests automatically.
Handling Cookies
The cookies
parameter can be used to send custom cookies:
client = httpx.Client(
cookies = {
‘sessionId‘: ‘xxxx‘
}
)
response = client.get(‘https://example.com/dashboard‘)
And cookies set by the server are automatically persisted in the client.
Streaming Responses
For large responses, you can stream them incrementally by iterating the response object:
response = client.get(‘https://example.com/bigfile‘)
for chunk in response:
process(chunk)
This avoids having to load the entire response into memory.
Proxy Support
To route requests through a proxy server, the proxies
parameter can be used:
proxy = ‘http://192.168.0.1:8888‘
client = httpx.Client(proxies = {‘http‘: proxy, ‘https‘: proxy})
Rotating different proxies helps distribute requests from different IPs.
Custom Headers
You can spoof or randomize request headers like user agents:
client = httpx.Client(headers = {
‘User-Agent‘: ‘MyBot 1.0‘
})
This mimics headers of a real browser.
Through these more advanced features, robust scrapers can be built using HTTPX and Python.
Example Scrapers
Let‘s now look at some example scrapers build with HTTPX.
Reddit API Scraper
Here‘s a basic scraper for the Reddit API:
import httpx
client = httpx.Client()
subreddit = ‘python‘
listing = ‘hot‘
limit = 10
response = client.get(f‘https://www.reddit.com/r/{subreddit}/{listing}.json?limit={limit}‘)
data = response.json()[‘data‘]
for post in data[‘children‘]:
title = post[‘data‘][‘title‘]
score = post[‘data‘][‘score‘]
print(f"{title} (Score: {score})")
This fetches data on the top posts from the Python subreddit. The API returns JSON which we can parse.
We could extend this scraper to extract data from multiple subreddits, store results in a database, etc.
News Article Scraper
Here is a simple news scraper that extracts articles from a site:
from bs4 import BeautifulSoup
import httpx
client = httpx.Client()
response = client.get("https://example.com/news")
soup = BeautifulSoup(response.text, ‘html.parser‘)
for article in soup.select(‘.article‘):
title = article.select_one(‘.article-title‘).text
content = article.select_one(‘.article-content‘).text
print(title)
print(content)
print()
This finds all .article
elements, extracts the title and content fields, and prints the articles.
Again this could be expanded to scrape additional fields, parse dates, store in a database, etc.
Search Results Scraper
And here is an example scraper for Google search results:
import httpx
query = "httpx python"
url = f"https://www.google.com/search?q={query}"
client = httpx.Client()
response = client.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, ‘html.parser‘)
for result in soup.select(‘.tF2Cxc‘):
title = result.select_one(‘.DKV0Md‘).text
link = result.select_one(‘.yuRUbf a‘)[‘href‘]
print(title)
print(link)
print()
This searches Google for a given query, parses the result links/titles, and prints them out.
Again many enhancements could be made like extracting search results counts, additional fields, scraping result pages, detecting captchas, etc.
These examples demonstrate common scraping patterns with HTTPX. The same techniques can be applied to build scrapers for many websites and APIs.
Summary
To summarize, HTTPX provides a powerful HTTP client for building Python web scrapers. Here are some key points:
-
HTTPX has a simple, requests-style API for making requests.
-
Async support allows making requests concurrently.
-
Robust error handling with timeouts, retries and status checks.
-
Scrape HTML pages with Beautiful Soup and JSON APIs easily.
-
Persistent clients provide connection pooling, sessions and cookie handling.
-
Advanced techniques like proxies, headers and auth enable sophisticated scrapers.
-
Follow best practices like using throttling, random delays and user-agents.
HTTPX makes it easy to start scraping with Python. With robust error handling and asynchronous concurrency, scalable scrapers can be developed.
Give HTTPX a try on your next Python web scraping project!