If you‘re writing Python scrapers or crawlers, configuring proxy support should be high on your list. Proxies allow you to route your requests through intermediary servers, hiding your real location. This opens up many possibilities like scraping sites at scale without getting blocked or accessing content restricted to certain regions.
In this comprehensive guide, I‘ll cover everything you need to know to use proxies with Python‘s requests module. I‘ll explain why proxies are useful, how they work, where to get them, best practices for authentication and security, and how to implement proxy rotation. Follow along and I‘ll make you a proxy pro!
Why Proxies are Essential for Python Scrapers
Let‘s first look at why proxies are so important for Python scrapers:
Avoid Getting Blocked when Scraping
The #1 reason to use proxies with Python requests is avoiding IP bans. Many sites have protections in place to block scrapers and bots. They may allow a certain number of requests per minute from a given IP before blacklisting it.
Scraping from a rotating pool of proxy servers makes you look like many different users. Sites will have a harder time detecting and blocking you compared to scraping from a single residential IP.
To give you a sense of scale, a site may allow 60-100 requests per minute per IP before triggering a ban. With even just a handful of proxies you can easily multiply your scraping capacity tenfold. Proxies enable scalability.
Access Geo-Restricted Content
Another benefit of proxies is being able to access content limited to certain regions. For example, the BBC iPlayer video streaming service is only available to UK residents. By routing your traffic through a UK proxy, you can view the site as if you were in London.
Other prominent examples include Hulu (US only), Channel 4 (UK), and NHK World (Japan). Proxies give you options for accessing region-restricted content from anywhere.
Anonymity and Security
Hiding your real IP address also enhances privacy while scraping. Sites will not be able to easily trace requests back to your location or identity.
Threat actors may also leverage proxies to mask attacks, but that‘s beyond the scope here. We‘ll focus on the positives of enhancing privacy and anonymity for web scraping.
Scale Python Scrapers
Proxies allow you to make significantly more parallel requests since you aren‘t limited by a single residential IP‘s capacity.
Rather than hitting threading limits or getting blocked with a single IP, you can route requests through multiple proxies to multiply the requests per minute you can make.
If each proxy allows 60 reqs/min, 4 proxies would give you capacity for 240 reqs/min. 10 proxies scale to 600 reqs/min. Proxies are essential for building distributed, high-volume scrapers in Python.
Now that you see why proxies are useful, let‘s dive into the details of how they work…
How Proxies Work: Anonymizing Your Requests
Proxies act as intermediaries for requests between you and the destination server:
Instead of connecting directly from your IP to the target site, your requests are first routed through the proxy server. This masks your real IP from the destination.
This works by configuring your HTTP requests to use the proxy‘s IP address instead of your own. We‘ll cover exactly how to configure this in Python later on.
Some key notes on how proxies function:
-
The proxy has its own unique IP that traffic appears to come from. This hides your real IP, replacing it with the proxy‘s.
-
Proxies can be chained together for additional anonymity. You can route traffic through multiple proxies to further obfuscate the origins.
-
Proxy protocols like HTTP and SOCKS handle passing traffic through. This is configured at the application layer for requests.
Now that you understand how proxies work at a fundamental level, let‘s go over the different types of proxies available.
HTTP vs SOCKS Proxies
The two main proxy protocols are HTTP and SOCKS. Let‘s compare them:
HTTP Proxies
HTTP proxies are the most common type you‘ll encounter. Some key attributes:
- Only works for HTTP/HTTPS traffic (not lower level TCP/UDP)
- Simple to set up – compatible with most libraries and tools
- Typically used for web scraping and general web access
HTTP proxies essentially intercept HTTP requests made by the client and forward them on to the destination. They are limited to HTTP traffic only.
SOCKS Proxies
SOCKS is a more full-featured proxy protocol that operates on lower network layers.
Some features:
- Works for any TCP traffic, including HTTP, HTTPS, FTP etc.
- Added authentication and security features like username/password auth.
- Typically used for full network access and anonymity.
Whereas HTTP proxies only operate at the application level, SOCKS sits lower at the network/transport layer. This allows SOCKS to proxy pretty much any TCP traffic.
Which Should You Use?
For most web scraping uses cases, a HTTP proxy is just fine. It‘s simpler to set up and you only care about directing your HTTP requests through proxies.
If you need full network access routing for lower level traffic beyond HTTP, use SOCKS instead. SOCKS is also better if you prioritize added security and need authentication.
For our uses focusing on Python web scrapers, HTTP proxies are perfectly suitable. Now let‘s look at where to obtain proxy servers.
Where to Get Proxies for Web Scraping
There are a few main methods of acquiring proxies to use with Python requests:
1. Buy Proxies from a Proxy Provider
The easiest way is to purchase proxies from a proxy service. Some top providers include:
-
BrightData – My favorite provider overall with high quality residential IPs worldwide. Fast connections and reliable uptime.
-
Oxylabs – Datacenter proxies available for all regions to support large volumes. Affordable pricing.
-
GeoSurf – Specializes in residential proxies for specific countries to access geo-restricted content.
Expect to pay around $1-$5 per proxy monthly, depending on provider quality and locations. Proxy service APIs make it easy to load lists of fresh proxies to integrate into your code.
2. Find Publicly Available Proxies
You can also find public proxies available for free online. Beware that these are lower quality since they are shared. Public proxies have high usage and often go offline.
Useful places to find public proxies:
- Checking public proxy lists
- Extracting proxies from sites like ProxyScrape
- Finding proxies using Google dorks searches
I don‘t recommend relying solely on public proxies, but they can augment paid ones in a pinch. Expect lower uptime/speeds.
3. Deploy Your Own Proxies
You can also create your own private proxies by deploying proxy servers on infrastructure like residential rotated IPs, cloud instances, or VPNs.
This gives you control but requires more effort to configure and maintain proxy servers. Typically you‘d outsource proxy provisioning to a provider instead, for simplicity.
In summary, I recommend purchasing proxies from a reputable provider like BrightData unless budget is severely limited. The reliability and quality outweighs dealing with tricky public proxies.
Next let‘s dive into the code to see how to configure Python requests using proxies…
Setting a Proxy – Python Requests Examples
Python requests makes it straightforward to direct your traffic through proxy servers.
You specify proxies by creating a proxies
dict that maps URL schemes to proxy URLs:
proxies = {
‘http‘: ‘http://10.10.1.10:3128‘,
‘https‘: ‘http://10.10.1.10:1080‘
}
Then pass this proxies dict when making requests:
response = requests.get(‘https://example.com‘, proxies=proxies)
This will route all HTTP and HTTPS requests through the specified proxies.
You can also set proxies globally for all requests or on a per-request basis. Let‘s look at examples of different proxy configurations with Python requests.
Global Proxy for All Requests
To apply a proxy globally to all requests made through the requests session, set the proxies dict at the session level:
import requests
session = requests.Session()
proxies = {
‘http‘: ‘http://10.10.1.10:3128‘,
‘https‘: ‘http://10.10.1.10:1080‘
}
session.proxies = proxies
response = session.get(‘https://example.com‘)
# Uses HTTP proxy http://10.10.1.10:3128
You can also do this by setting the environment variables HTTP_PROXY
and HTTPS_PROXY
before running your script.
Proxy per Request
To use a proxy for only a specific request, pass the proxies dict as a parameter just for that call:
import requests
response = requests.get(‘https://example.com‘) # no proxy
proxied_response = requests.get(‘https://example.com‘, proxies={
‘http‘: ‘http://10.10.1.10:3128‘,
‘https‘: ‘http://10.10.1.10:1080‘
}) # uses proxy
This overrides the global proxy just for this one request.
Proxy for Specific Domain
To proxy traffic only for certain domains, specify the domain in your proxies dict:
proxies = {
‘http://scrape-site.com‘: ‘http://10.10.1.10:3128‘,
‘https://api.example.com‘: ‘http://10.10.1.10:1080‘,
}
requests.get(‘http://scrape-site.com/‘, proxies=proxies) # uses proxy
requests.get(‘http://no-proxy-domain.com‘, proxies=proxies) # no proxy
This allows granular control over which sites use proxies vs not.
Now that you know how to apply proxies, let‘s discuss how to authenticate with proxies…
Authenticating with Proxies
Many proxies will require authentication to use them. This involves passing username/password credentials in your proxy URLs.
Here is an example HTTP proxy URL with authentication:
http://myusername:[email protected]:8080
Simple enough, but there is an extra consideration if your username or password contains special characters.
Many special characters like @
and :
are invalid in basic URL syntax. To handle these cases, we need to URL encode the credentials with the urllib
library:
from urllib.parse import quote
username = ‘[email protected]‘
password = ‘pass#123‘
proxy_url = f‘http://{quote(username)}:{quote(password)}@123.45.6.7:8080‘
This will properly encode those values so they can be passed in the URL.
Now your credentials can contain special characters and you can successfully authenticate.
With that squared away, let‘s move on to discuss rotating proxies…
Rotating Proxies to Avoid Bans
When scraping websites, you‘ll want to rotate your requests across multiple proxy IPs. This prevents you from getting banned by sites for making too many requests from a single IP.
Here is one way to implement proxy rotation in Python:
import requests
from random import choice
proxy_list = [
‘http://123.45.6.7:8080‘,
‘http://98.76.54.3.2:8080‘,
‘http://103.47.99.2:8080‘
]
for _ in range(10):
proxy = choice(proxy_list)
response = requests.get(‘https://example.com‘, proxies={
‘http‘: proxy,
‘https‘: proxy
})
# Do something with response...
We maintain a list of proxy URLs. Before each request, we randomly choose a proxy using Python‘s random.choice()
. This rotates proxies with each request.
You can load your list of proxies from a file, proxy API, database, or other source. Refresh it periodically to cycle in new proxies as old ones go bad.
Ideally use at least 10+ proxies and change them at least every 100 requests to be safe. The more the better.
Be sure to implement similar proxy rotation in your production scrapers to stay under the radar.
Final Thoughts on Proxies with Python Requests
And there you have it – a comprehensive guide to using proxies with Python‘s requests module!
We covered the importance of proxies for Python web scrapers, how they work under the hood, where to obtain proxies, how to configure requests to use proxies, authentication, and proxy rotation patterns.
Proxies are crucial for scraping sites successfully at scale and avoiding IP bans. With the techniques outlined here, you can leverage proxies like a pro!
For even more advanced proxy usage, refer to the requests documentation. Now go forth and use your newfound proxy powers for good! Let me know if you have any other proxy questions.