Skip to content

How to Web Scrape with HTTPX and Python

Web scraping is the process of extracting data from websites automatically. It‘s a common technique used to gather large amounts of data for analysis, machine learning and more. In Python, there are many great libraries that make web scraping easy. One popular option is HTTPX.

HTTPX is a powerful and modern HTTP client for Python. It was created by the developers behind Requests and takes a lot of inspiration from Requests, while also adding new functionality like HTTP/2 support.

In this comprehensive guide, we‘ll explore how to effectively scrape websites in Python using HTTPX.

Getting Started with HTTPX

To get started, HTTPX can be installed via pip:

pip install httpx

Alternatively, Poetry can be used:

poetry add httpx

Once installed, HTTPX can be imported and used:

import httpx

response = httpx.get(‘https://example.com‘)
print(response.text)

This will make a GET request to example.com and print the HTML of the homepage.

HTTPX has a simple API and supports all popular HTTP verbs like GET, POST, PUT, DELETE, HEAD, OPTIONS, etc.

Some key features include:

  • HTTP/1.1 and HTTP/2 support
  • Asynchronous aiohttp-style API
  • Connection pooling and keepalive
  • Proxy support
  • Timeout configuration
  • Cookie persistence
  • Familiar requests-style API

Next let‘s look at some common usage patterns.

Making Requests with HTTPX

To make a GET request, the httpx.get() method can be used:

response = httpx.get(‘https://example.com‘)

Similarly, httpx.post(), httpx.put(), httpx.delete(), etc. can be used for other HTTP verbs.

Parameters like headers, cookies, timeout, etc. can be passed in as keyword arguments:

response = httpx.get(
  ‘https://httpbin.org/headers‘,
  headers = {
    ‘User-Agent‘: ‘MyBot 1.0‘
  },
  timeout = 10.0
)

The response contains properties like status_code, headers, text, json(), etc:

print(response.status_code)
print(response.headers)
print(response.text)
json = response.json()

You can also stream the response incrementally by iterating over the response object.

Using an HTTPX Client

For most scraping tasks, it‘s recommended to use a persistent httpx.Client instance.

The client handles things like connection pooling, sessions, cookies, etc. across multiple requests.

import httpx

client = httpx.Client()

response = client.get(‘https://example.com‘)
print(response.text)

response = client.get(‘https://httpbin.org/cookies/set?name=foo‘)
print(response.text) 

response = client.get(‘https://httpbin.org/cookies‘)
print(response.json())

Here we make multiple requests using the same client, which handles cookie persistence automatically.

You can also configure options like headers, proxies, auth, etc. when creating the client:

client = httpx.Client(
  headers = {
    ‘User-Agent‘: ‘MyBot 1.0‘,
    ‘Authorization‘: ‘Bearer xxx‘   
  },
  proxies = ‘http://192.168.1.1:8181/‘,
  auth = (‘username‘, ‘password‘)
)

Now let‘s look at makings requests asynchronously.

Asynchronous Requests with HTTPX

To make requests asynchronously in Python, HTTPX provides an AsyncClient:

import httpx

async with httpx.AsyncClient() as client:
  response = await client.get(‘https://example.com‘) 

We use async with to initialize the client, await on the request, and automatically close the client afterwards.

To scrape multiple URLs concurrently, we can use asyncio.gather():

import httpx
import asyncio

async def get_url(url):
  async with httpx.AsyncClient() as client:
    response = await client.get(url)
    return response.text

urls = [‘https://example.com‘, ‘https://httpbin.org‘, ‘https://python.org‘]

async def main():
  responses = await asyncio.gather(*[get_url(url) for url in urls])
  print(responses)

asyncio.run(main())

asyncio.gather() concurrently awaits multiple coroutines and returns results in the order of the awaitables.

There are also other options like asyncio.as_completed() to process them as they complete:

tasks = [get_url(url) for url in urls]

async def main():
  for result in asyncio.as_completed(tasks):
    print(await result)

Async IO enables concurrently fetching multiple pages at once, which is useful for speeding up scraping.

Next let‘s look at scraping data from HTML and JSON responses.

Scraping HTML and JSON Responses

For HTML scraping, we can use a parser like Beautiful Soup to extract data:

from bs4 import BeautifulSoup

response = httpx.get(‘https://en.wikipedia.org/wiki/Python_(programming_language)‘)

soup = BeautifulSoup(response.text, ‘html.parser‘)

for link in soup.select(‘.toctext‘):
  print(link.text.strip())

This prints the contents of the Wikipedia page‘s table of contents.

For JSON responses, HTTPX provides a built-in .json() method:

response = httpx.get(‘https://api.github.com/repos/encode/httpx‘)

json = response.json()
print(json[‘description‘])

The json= parameter can also be used to serialize JSON data in requests.

Together with a parser, we can build scrapers to extract data from APIs and websites.

Handling Issues and Errors

While scraping, there are often issues like connection errors, timeouts, ratelimits, etc. that come up.

HTTPX provides exceptions and tools to handle them appropriately.

Timeouts

To handle slow responses, you can specify a custom timeout parameter. The default is 5 seconds.

response = httpx.get(‘https://example.com‘, timeout=10.0) 

If this is exceeded, a httpx.TimeoutException occurs:

try:
  response = httpx.get(‘https://example.com‘, timeout=0.001)
except httpx.TimeoutException:
  print(‘Request timed out‘)

Longer timeouts may be needed for certain sites or pages.

HTTP Errors

For HTTP errors like 400, 500 etc. the response.raise_for_status() method can be used:

try:
  response = httpx.get(‘https://httpbin.org/status/500‘)
  response.raise_for_status()
except httpx.HTTPStatusError:
  print(‘HTTP Error occurred‘)  

This will raise an exception on any 4xx or 5xx status code.

Retrying Requests

To add retry logic, external packages like tenacity can be used:

from tenacity import retry, stop_after_attempt

@retry(stop=stop_after_attempt(3))
def make_request():
  response = httpx.get(‘https://example.com‘)
  response.raise_for_status()
  return response.json()

data = make_request()

Here we retry up to 3 times on any exception. More advanced retry logic can also be defined.

Parallel Requests Limits

When making many requests in parallel, you may encounter connection limits.

The limits parameter can be used to configure options like max connections:

client = httpx.AsyncClient(
  limits = httpx.Limits(max_connections=20)
)

Tuning this parameter based on the target site can help avoid limits.

By handling these common issues, more resilient scrapers can be built.

Scraping Best Practices

Here are some tips for creating effective web scrapers with HTTPX:

  • Use an HTTPX Client – Clients provide connection pooling, cookie persistence and other benefits.

  • Scrape Politely – Limit request rates to avoid overwhelming servers. Use random delays and throttling.

  • Handle Errors – Use try/except blocks, status checks, and retries to handle problems.

  • Use Async IO – Scrape pages concurrently to improve speed. But limit concurrency to avoid bans.

  • Randomize User Agents – Rotate random user agent strings to appear more human.

  • Use Proxies – Rotate different proxies/IPs to distribute requests.

  • Cache and Persist Data – Save scraped data to files/databases to avoid re-scraping.

By following best practices like these, more robust and maintainable scrapers can be built.

Advanced Scraping Techniques

Let‘s look at some more advanced scraping capabilities of HTTPX.

Scraping Authentication Pages

To scrape authenticated pages, HTTPX supports multiple auth types like Basic, Digest and Bearer auth:

client = httpx.Client(
  auth = (‘username‘, ‘password‘)
)

response = client.get(‘https://api.example.com/users/me‘)

Authentication credentials are persisted across requests automatically.

Handling Cookies

The cookies parameter can be used to send custom cookies:

client = httpx.Client(
  cookies = {
    ‘sessionId‘: ‘xxxx‘
  }
)

response = client.get(‘https://example.com/dashboard‘) 

And cookies set by the server are automatically persisted in the client.

Streaming Responses

For large responses, you can stream them incrementally by iterating the response object:

response = client.get(‘https://example.com/bigfile‘)
for chunk in response:
  process(chunk) 

This avoids having to load the entire response into memory.

Proxy Support

To route requests through a proxy server, the proxies parameter can be used:

proxy = ‘http://192.168.0.1:8888‘ 

client = httpx.Client(proxies = {‘http‘: proxy, ‘https‘: proxy})

Rotating different proxies helps distribute requests from different IPs.

Custom Headers

You can spoof or randomize request headers like user agents:

client = httpx.Client(headers = {
  ‘User-Agent‘: ‘MyBot 1.0‘ 
})

This mimics headers of a real browser.

Through these more advanced features, robust scrapers can be built using HTTPX and Python.

Example Scrapers

Let‘s now look at some example scrapers build with HTTPX.

Reddit API Scraper

Here‘s a basic scraper for the Reddit API:

import httpx

client = httpx.Client()

subreddit = ‘python‘ 
listing = ‘hot‘
limit = 10

response = client.get(f‘https://www.reddit.com/r/{subreddit}/{listing}.json?limit={limit}‘)

data = response.json()[‘data‘]

for post in data[‘children‘]:
   title = post[‘data‘][‘title‘]
   score = post[‘data‘][‘score‘]
   print(f"{title} (Score: {score})")

This fetches data on the top posts from the Python subreddit. The API returns JSON which we can parse.

We could extend this scraper to extract data from multiple subreddits, store results in a database, etc.

News Article Scraper

Here is a simple news scraper that extracts articles from a site:

from bs4 import BeautifulSoup
import httpx

client = httpx.Client()

response = client.get("https://example.com/news")
soup = BeautifulSoup(response.text, ‘html.parser‘)

for article in soup.select(‘.article‘):

  title = article.select_one(‘.article-title‘).text
  content = article.select_one(‘.article-content‘).text

  print(title) 
  print(content)
  print() 

This finds all .article elements, extracts the title and content fields, and prints the articles.

Again this could be expanded to scrape additional fields, parse dates, store in a database, etc.

Search Results Scraper

And here is an example scraper for Google search results:

import httpx

query = "httpx python"
url = f"https://www.google.com/search?q={query}"

client = httpx.Client()
response = client.get(url)

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, ‘html.parser‘)

for result in soup.select(‘.tF2Cxc‘):

  title = result.select_one(‘.DKV0Md‘).text
  link = result.select_one(‘.yuRUbf a‘)[‘href‘]

  print(title)
  print(link) 
  print()

This searches Google for a given query, parses the result links/titles, and prints them out.

Again many enhancements could be made like extracting search results counts, additional fields, scraping result pages, detecting captchas, etc.

These examples demonstrate common scraping patterns with HTTPX. The same techniques can be applied to build scrapers for many websites and APIs.

Summary

To summarize, HTTPX provides a powerful HTTP client for building Python web scrapers. Here are some key points:

  • HTTPX has a simple, requests-style API for making requests.

  • Async support allows making requests concurrently.

  • Robust error handling with timeouts, retries and status checks.

  • Scrape HTML pages with Beautiful Soup and JSON APIs easily.

  • Persistent clients provide connection pooling, sessions and cookie handling.

  • Advanced techniques like proxies, headers and auth enable sophisticated scrapers.

  • Follow best practices like using throttling, random delays and user-agents.

HTTPX makes it easy to start scraping with Python. With robust error handling and asynchronous concurrency, scalable scrapers can be developed.

Give HTTPX a try on your next Python web scraping project!

Tags:

Join the conversation

Your email address will not be published. Required fields are marked *