Introduction
If you‘re doing any kind of web scraping or automated interaction with websites, chances are you‘ll need to use proxies at some point. A proxy acts as an intermediary between your computer and the internet, making requests on your behalf. There are several key benefits to routing your requests through a proxy server:
- Anonymity – The target website will see the proxy‘s IP address instead of yours, helping keep your identity private.
- Security – Proxies provide an additional layer of protection between your machine and the internet.
- Bypassing restrictions – If a website has blocked your IP address, you can use proxies from a different location to regain access. Proxies are also useful for circumventing regional content blocks and censorship.
In this guide, we‘ll walk through how to use proxies with the popular Python requests library, including how to rotate through multiple proxy IP addresses to avoid detection and bans while scraping. Let‘s get started!
Prerequisites
Before we dive in, make sure you have the following:
- Python 3 installed on your local machine
- The
requests
library installed
You can check if requests
is already installed by opening a terminal and running:
pip freeze
Look through the list of packages to see if requests
is there. If not, you can install it by running:
pip install requests
Using a Proxy with Python Requests
Now let‘s see how to actually use a proxy when making HTTP requests with Python. First, make sure to import the requests
library at the top of your script:
import requests
Next, we need to define the proxy servers we want to route our requests through. Create a proxies
dictionary that maps protocols to proxy URLs like this:
proxies = {
‘http‘: ‘http://user:[email protected]:8080‘,
‘https‘: ‘http://user:[email protected]:8080‘
}
Here we‘re specifying different proxy servers for HTTP and HTTPS connections, along with the credentials needed to authenticate. The URL format is:
protocol://user:password@host:port
If your proxy doesn‘t require authentication, you can omit the user:pass
portion.
To use these proxies, simply pass the proxies
argument when making a request:
response = requests.get(‘http://example.com‘, proxies=proxies)
This uses the proxies defined in the proxies
dict based on the protocol of the target URL. All the standard request methods are supported:
response = requests.get(url, proxies=proxies)
response = requests.post(url, data=payload, proxies=proxies)
response = requests.put(url, data=payload, proxies=proxies)
response = requests.patch(url, data=payload, proxies=proxies)
response = requests.delete(url, proxies=proxies)
If you find yourself making many requests using the same proxies, you can avoid repetition by using a Session
object. Sessions allow you to persist certain parameters across requests, like cookies and proxies:
session = requests.Session()
session.proxies = {
‘http‘: ‘http://user:[email protected]:8080‘,
‘https‘: ‘http://user:[email protected]:8080‘
}
response = session.get(‘http://example.com‘)
For convenience, you can also set your proxy URLs as environment variables:
import os
os.environ[‘HTTP_PROXY‘] = ‘http://user:[email protected]:8080‘
os.environ[‘HTTPS_PROXY‘] = ‘http://user:[email protected]:8080‘
Then you can omit the proxies
argument when making requests and it will automatically apply the environment proxies.
Finally, to access the response data from your proxied request:
response.text # response body as string
response.content # response body as bytes
response.json() # parse response body as JSON
Rotating Proxies
When scraping a website, using the same proxy repeatedly can quickly get your IP address blocked. To circumvent this, you can rotate through a pool of proxy servers, making each request from a different IP address.
Here‘s a basic script to randomly select a proxy from a list for each request:
import requests import random
proxies = [ {‘http‘: ‘http://user:[email protected]:8080‘}, {‘https‘: ‘http://user:[email protected]:8080‘}, {‘http‘: ‘http://user:[email protected]:8080‘}, {‘https‘: ‘http://user:[email protected]:8080‘} ]
def random_proxy(): return random.choice(proxies)
for i in range(10): proxy = random_proxy()
try: print(f‘Request #{i}, using proxy {proxy}‘) response = requests.get(‘http://httpbin.org/ip‘, proxies=proxy, timeout=5) print(response.json()) except Exception as e: print(f‘Request failed: {e}‘)
This selects a random proxy from the list for each request. The
timeout
argument specifies the number of seconds to wait for a response before giving up, which is useful when some proxies in your pool may be unresponsive. We wrap each request in atry/except
to catch any errors that may occur.Keep in mind that free proxy lists often contain many outdated or non-functional proxies. For production scraping, it‘s usually worth paying for a private proxy service that offers a large, reliable pool of IP addresses to maximize your success rate.
Using ScrapingBee‘s Proxy Mode
If you don‘t want to deal with the hassle of finding and configuring proxies yourself, ScrapingBee‘s Proxy Mode provides an easy alternative. It‘s a proxy frontend for the ScrapingBee API that allows you to funnel requests through their proxy servers.
You‘ll first need to sign up for a free ScrapingBee account to get an API key. Then you can make proxied requests by specifying your API key in the proxy URL:
import requests
proxies = { ‘http‘: ‘http://YOUR_API_KEY:[email protected]:8886‘, ‘https‘: ‘http://YOUR_API_KEY:[email protected]:8887‘ }
response = requests.get(‘http://httpbin.org/ip‘, proxies=proxies, verify=False) print(response.text)
The
render_js
andpremium_proxy
parameters are optional API flags. See the ScrapingBee API documentation for the full list of available options.Note the
verify=False
argument to disable SSL verification, which is required when using ScrapingBee‘s proxies.With ScrapingBee‘s Proxy Mode, you get access to a large pool of reliable, fast proxies managed by their service, with 1000 free API calls to start. This allows you to offload the complexities of proxy rotation and focus on your scraping logic.
Conclusion
You should now have a solid understanding of how to use proxies with Python‘s requests library for anonymous and efficient web scraping. A few key takeaways:
- Proxies help keep your scraping undetected by masking your true IP address.
- Rotating proxies further reduces the chance of your scrapers getting blocked.
- Elite anonymous proxies are best for avoiding detection, while transparent proxies should generally be avoided.
- Using a managed proxy service like ScrapingBee can save a lot of time and hassle versus maintaining your own proxy pools.
I encourage you to try applying these techniques to your own scraping projects. Start by making a few test requests through different proxy servers and verifying the IP address. Then set up a basic rotation script to cycle through all your available proxies.
With proxies in your toolkit, you‘ll be able to scrape larger volumes of data from more sources without triggering bans or CAPTCHAs. The next step is learning how to inspect response headers and handle different types of authentication. But you‘re now well on your way to becoming a professional web scraper!
Happy scraping!