If you‘re looking to take your web scraping projects to the next level in 2024, using proxies with Python Requests is an essential skill to master. Proxies allow you to mask your IP address and make requests from multiple locations, which can help you avoid IP bans and rate limits when scraping data from websites.
In this in-depth guide, we‘ll walk you through everything you need to know to start using proxies with Python Requests effectively. Whether you‘re a beginner or an experienced developer, you‘ll come away with actionable tips and full code examples you can implement in your own projects. Let‘s dive in!
What Are Proxies and Why Use Them for Web Scraping?
First, let‘s clarify what proxies are and why they are so useful for web scraping. A proxy server acts as a middleman between your computer and the internet. When you use a proxy, your requests are routed through the proxy server first before reaching the destination website.
The key benefits of using proxies for web scraping include:
-
IP masking: Proxies hide your real IP address from websites you are scraping. This makes it harder for them to identify and block your scraper.
-
IP rotation: By using a pool of multiple proxy IP addresses and rotating them for each request, you can distribute your requests and prevent rate limiting. IP rotation is essential for large-scale scraping.
-
Geotargeting: Proxies allow you to send requests from IP addresses in different locations around the world. This is useful if you need to scrape location-specific data or test how a website behaves for users in different regions.
While you can find free proxy lists, the reality is that free proxies are often slow, unreliable, and even dangerous. Free proxies can inject ads into pages, steal your data, or stop working unexpectedly.
For these reasons, I recommend investing in a reputable paid proxy service if you‘re serious about web scraping. The top providers I recommend based on recent tests are:
These providers offer reliable, fast proxies and helpful customer support to get you unstuck quickly. In my experience, residential proxies sourced from real devices tend to be the least likely to get blocked when scraping.
Step-by-Step Guide: How to Use Proxies with Python Requests
Now that you understand the importance of proxies for web scraping, let‘s walk through how to actually use them with the Python Requests library. We‘ll cover setting up proxies, using a session to reuse proxy settings, setting environment variables, and full code examples.
1. Install the Requests library
First, make sure you have the Requests library installed. You can install it using pip:
pip install requests
2. Define your proxy settings
Next, define your proxy settings in a Python dictionary. You‘ll need to specify the protocol (http or https), IP address, port, and authentication details if required.
Here‘s an example of how to define proxy settings for HTTP and HTTPS:
proxies = {
"http": "http://user:[email protected]:3128",
"https": "http://user:[email protected]:1080",
}
If your proxy doesn‘t require authentication, you can leave out the username and password:
proxies = {
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080",
}
3. Make a request using the proxies
To make a request using your proxies, simply pass the proxies dictionary to the proxies parameter of requests.get(), post() or other methods.
import requests
response = requests.get("http://example.com", proxies=proxies)
print(response.text)
This will route your request through the specified proxy server. You can inspect the response to make sure it succeeded.
4. Reuse proxy settings with a Session
If you need to make multiple requests using the same proxy settings, it‘s more efficient to create a Session object. A Session allows you to reuse the same TCP connection for multiple requests, which can significantly speed things up.
Here‘s how to create a Session and use it with proxies:
import requests
session = requests.Session()
session.proxies = {
"http": "http://user:[email protected]:3128",
"https": "http://user:[email protected]:1080",
}
response = session.get("http://example.com")
print(response.text)
The session will now reuse the same proxy settings for all requests made through it. This is handy if you need to sign in to a website or maintain state between requests.
5. Set proxy environment variables
For greater flexibility, you can store your proxy settings in environment variables. This allows you to easily switch between different proxy configurations without changing your code.
Here‘s how to set proxy environment variables on Linux/MacOS:
export HTTP_PROXY="http://user:[email protected]:3128"
export HTTPS_PROXY="http://user:[email protected]:1080"
And here‘s how to set them on Windows:
set HTTP_PROXY=http://user:[email protected]:3128
set HTTPS_PROXY=http://user:[email protected]:1080
Then in your Python script, you can read the proxy settings from the environment variables:
import os
import requests
proxies = {
"http": os.environ.get("HTTP_PROXY"),
"https": os.environ.get("HTTPS_PROXY")
}
response = requests.get("http://example.com", proxies=proxies)
print(response.text)
This makes it easy to switch between different proxy configurations for different scraping tasks, without having to modify your scripts.
Rotate Proxies to Distribute Requests
To really leverage the power of proxies for web scraping, you‘ll want to spread your requests over multiple IP addresses. This is known as IP rotation, and it helps prevent your scrapers from getting rate limited or blocked by websites.
The basic process is:
- Create a pool of available proxy IP addresses to use
- Choose a random proxy from the pool for each new request
- Rotate through the proxy pool, distributing requests evenly
Here‘s a full code example that rotates through a pool of proxies:
import requests
from random import choice
proxy_pool = [
"http://user:[email protected]:80",
"http://user:[email protected]:80",
"http://user:[email protected]:80",
"http://user:[email protected]:80",
"http://user:[email protected]:80"
]
for i in range(10):
proxy = {
"http": choice(proxy_pool),
"https": choice(proxy_pool)
}
try:
response = requests.get("http://example.com", proxies=proxy, timeout=5)
print(f"Proxy {proxy[‘http‘]} succeeded: {response.status_code}")
except:
print(f"Proxy {proxy[‘http‘]} failed")
This script will make 10 requests to http://example.com, randomly choosing a different proxy from the pool for each one. I‘ve wrapped the request in a try/except block to handle failures gracefully if a particular proxy stops working.
With a large enough pool of proxies, IP rotation makes it very unlikely for your scrapers to get blocked, even if you‘re scraping a lot of pages from a single website. Just make sure to space out your requests a bit to avoid putting too much strain on any one proxy.
Wrapping Up
You should now have a solid understanding of how to use proxies with Python Requests to anonymize and scale your web scraping projects. To recap, the key things to remember are:
-
Proxies mask your IP address and allow you to distribute requests from multiple locations. This helps prevent IP bans and rate limiting.
-
Investing in a reliable paid proxy service is worth the money for stability and performance. I recommend Bright Data, IPRoyal, Proxy-Seller, SOAX, Smartproxy, Proxy-Cheap, or HydraProxy.
-
You can specify proxy settings globally, in a Session, or using environment variables for flexibility. Follow the code examples in this guide to get started.
-
Use a proxy pool and IP rotation to distribute your requests over many IP addresses. This is key for large-scale scraping.
For more tips, check out my other guides on the best proxies for web scraping and evaluating free proxy services. You can find all the code examples from this tutorial on my Github page.
If you have any other questions, feel free to reach out on Twitter @yourusername or in the comments below. Happy scraping!