How to follow redirect using cURL? | ScrapingBee - Web Scraping Site

How to Follow Redirects using cURL: The Ultimate Guide

Introduction

If you‘ve ever used the command line to interact with websites and APIs, you‘ve probably used cURL. cURL is a powerful tool for transferring data using various network protocols. It‘s especially handy for testing APIs, debugging network issues, and automating tasks.

One of cURL‘s many useful features is its ability to automatically follow HTTP redirects. This means if you request a URL that redirects to another location, cURL will detect this and follow the redirection for you. In this guide, we‘ll take an in-depth look at using cURL to handle redirects.

HTTP Redirect Overview

Before we dive into the specifics of cURL, let‘s review what HTTP redirects are and why they‘re used.

An HTTP redirect is a way for a server to tell a client (like your web browser or cURL) that a resource has been moved to a different URL. When a client makes a request to a URL that has been redirected, the server sends back a special response code and the new URL location.

The main types of redirects are:

301 Permanent Redirect: Indicates that the resource has been permanently moved to a new URL. Clients should update their links/bookmarks.
302 Temporary Redirect: Means the resource is temporarily located at a different URL. The client should continue to use the original URL in the future.
307 Temporary Redirect: Similar to 302, but specifically states that the client should not change the HTTP method (POST, GET, etc.) when following the redirect.
308 Permanent Redirect: Like 301, but also specifies the HTTP method should not change.

Some common scenarios where redirects are used include:

Moving a website to a new domain
Forcing the use of HTTPS for security
Redirecting mobile users to a mobile-optimized version of a site
URL shortening services

Using cURL to Follow Redirects

By default, if you make a cURL request to a URL that returns a redirect, cURL will simply report the redirect response code and headers. It won‘t automatically follow the redirection.

To instruct cURL to follow redirects, you simply need to use the -L or --location command-line option. Here‘s an example:

curl -L http://example.com

With the -L option, cURL will detect any redirects and follow them automatically, eventually returning the content of the final URL in the redirect chain.

If we run the command with -v for verbose output, we can see the redirects happening:

curl -v -L http://httpbin.org/redirect/2

The response will include lines like:

< HTTP/1.1 302 FOUND
< Location: /redirect/1

< HTTP/1.1 302 FOUND 
< Location: /get

< HTTP/1.1 200 OK

This shows that cURL first got a 302 redirect to /redirect/1, then another 302 to /get, which returned a final 200 OK response.

Configuring cURL‘s Redirect Behavior

cURL provides a number of options to customize how it handles redirects:

--max-redirs NUM: Sets the maximum number of redirects that cURL will follow. The default is 50.

--proto-redir PROTOCOLS: Limits which protocols cURL will automatically redirect to. By default it allows all protocols on the initial URL.

--post302: Forces cURL to maintain the request method after a 302 redirect. Helpful if you want to continue making a POST after redirects.

For example, to allow a maximum of 5 redirects and only allow redirects to HTTPS URLs:

curl --max-redirs 5 --proto-redir https -L http://example.com

Handling Cookies with Redirects

One potential complication with redirects is handling cookies. If a server sets cookies on the initial response or a redirect response, you often need to store those cookies and include them in the redirected request.

With cURL, you can use the -c option to specify a file to store cookies, and the -b option to pass those cookies on the next request:

curl -c cookies.txt -L http://example.com
curl -b cookies.txt -L http://example.com

The first command will store any cookies in cookies.txt. The second command will read the cookies from that file and include them in the request.

Debugging and Troubleshooting

When you‘re working with complicated redirection scenarios, it can be helpful to get more details on what cURL is doing under the hood. We already saw the -v option for verbose output.

For even more details, you can use --trace or --trace-ascii to log a full trace of the request/response data:

curl --trace-ascii trace.log -L http://example.com

This will create a trace.log file with extensive debugging information about the request and response headers and data.

Real-World Examples

Let‘s walk through a few more practical examples of using cURL with redirects.

Following a redirect to the mobile/desktop version of a site:

# Desktop user-agent
curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0" -L https://example.com

# Mobile user-agent  
curl -A "Mozilla/5.0 (iPhone; CPU iPhone OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Mobile/15E148 Safari/604.1" -L https://example.com

Many sites will redirect mobile or desktop users to different versions of the site. By setting the user-agent string with -A, we can simulate different devices.

Handling redirects when submitting a login form:

curl -c cookies.txt -L -d "username=user&password=pass" https://example.com/login
curl -b cookies.txt -L https://example.com/profile

Here we submit a login form and store the session cookies. Then use those cookies to access a profile page that requires authentication.

Following redirects through URL shorteners:

curl -IL https://bit.ly/3kd8dwz

URL shorteners work by redirecting to the full, long URL. By using -I to make a HEAD request and -L to follow redirects, we can see the final destination URL.

Redirect Considerations for Web Scraping

Redirects can sometimes complicate web scraping tasks. It‘s important to be aware that websites may use redirects for various reasons:

Redirecting to a new version of the page
A/B testing different variations of content
Anti-bot measures that redirect suspicious traffic
Paywalls or login walls

To successfully scrape content behind redirects, you‘ll need to ensure your scraping tool (whether that‘s cURL or a headless browser) is configured to properly handle redirects – following them to get the final page content.

Another thing to watch out for is redirect loops or long redirect chains. Some sites may intentionally redirect in a loop to frustrate scrapers. Setting a maximum redirect limit can help avoid getting stuck.

Conclusion

We‘ve covered a lot of ground in this guide to using cURL with redirects. To recap the key points:

HTTP redirects are a way for servers to tell clients a resource has moved
cURL can automatically follow redirects with the -L or --location option
You can limit and control cURL‘s redirect behavior with additional options
Proper cookie handling is often needed when dealing with redirects
Verbose output and tracing provide helpful debugging details
Real-world redirect scenarios include handling mobile/desktop redirects, login forms, and URL shorteners
Redirects are an important consideration when web scraping

I encourage you to try out the examples in this guide and refer to the cURL man pages to learn even more about its capabilites. With the ability to follow redirects and all its other features, cURL is an indispensible tool for anyone working with websites and APIs.

How to follow redirect using cURL? | ScrapingBee

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide