Using Python and wget to Easily Download Web Pages and Files

Python is a versatile language used for a wide variety of applications, from web development to data analysis to automation. One common task you may encounter is the need to download files or web pages from the internet. While Python provides built-in libraries like urllib for this purpose, sometimes an external tool can make the job even easier. That‘s where wget comes in.

In this article, you‘ll learn how to use Python to interface with the popular wget utility for downloading web content. We‘ll cover the advantages of this approach, walk through installation and setup, and demonstrate how to use wget in Python to download both individual files and entire websites. Let‘s get started!

What is wget?

Wget is a free command line utility for downloading files from the web. It supports the HTTP, HTTPS, and FTP protocols, and can retrieve content recursively, making it a powerful tool for mirroring websites. Wget is widely used for automating downloads and for retrieving content from the command line.

Some key features of wget include:

Resuming partially completed downloads
Converting links for local viewing of downloaded content
Robustly handling network issues and slow connections
Supporting HTTP cookies and authentication
Customizing HTTP headers and user agents
Rate limiting and wait times between requests

Wget is a mature open source project that originated in 1996. It is widely available on Unix-like systems, and ports exist for Windows and other platforms as well.

Why Use wget with Python?

If you‘re working in Python, you may be wondering why you would want to use an external program like wget instead of a native Python library. Here are a few reasons:

Simplicity: wget is invoked with a simple command line interface, so you don‘t need to write much code to use it in Python. This can be faster and more straightforward than using a Python library directly.
Robustness: wget is a time-tested tool that can handle a variety of error conditions. It has configurable timeout and retry options to help ensure downloads succeed.
Recursion: one of wget‘s most powerful features is its ability to recursively download content by following links. This enables you to easily mirror an entire website.
Shell parity: if you‘re used to using wget on the command line, you can maintain a similar interface within Python.
Subprocesses: using wget allows you to offload the work of downloading to a subprocess. This can help prevent your main Python process from blocking on long-running downloads.

That said, using an external tool like wget also has some downsides. It adds a dependency to your project, and may complicate installation. You also have less programmatic control than with a native library.

For basic download tasks, native Python libraries like urllib, requests, or aiohttp are likely a better choice. But for cases where you need recursion or advanced configuration, wget can be a helpful tool in your Python toolkit.

Installing wget

Before you can use wget in your Python programs, you‘ll need to install it on your system if it‘s not already available. How you do this depends on your operating system.

On many Unix-like systems, including Linux and macOS, wget may already be installed. You can check by opening a terminal and typing wget --version. If you see version information, you‘re all set. If not, you can install wget using your system‘s package manager.

For example, on Debian/Ubuntu systems, run:

sudo apt update
sudo apt install wget

On macOS, you can install wget with Homebrew:

brew install wget

On Windows, wget binaries are available, but are not included by default. To install, download the latest wget.exe from the official GnuWin32 site. Add the directory containing wget.exe to your PATH environment variable so you can execute it from any location.

Once you have wget installed, you can move on to calling it from Python.

Running wget in Python

To use wget in a Python script, we‘ll need a way to spawn a new process and execute shell commands. Python provides a few ways to do this, but we‘ll use the standard subprocess module.

Here‘s a simple example of running wget from Python to download a file:

import subprocess

url = ‘https://example.com/file.zip‘
output_dir = ‘downloads‘

subprocess.run([‘wget‘, ‘-P‘, output_dir, url])

This code imports the subprocess module and defines the target URL and output directory. It then uses subprocess.run() to execute wget with a few arguments:

-P downloads specifies the output directory
The URL to download comes last

The subprocess.run() function waits for the wget process to finish before continuing. If the file already exists, wget will check if it is newer than the one on the server. If not, it will skip the download.

You can extend this technique to handle error cases and capture output:

import subprocess

url = ‘https://example.com/file.zip‘
output_dir = ‘downloads‘

try:
    result = subprocess.run([‘wget‘, ‘-P‘, output_dir, url], 
        stdout=subprocess.PIPE, 
        stderr=subprocess.PIPE,
        text=True) 

    if result.returncode == 0:
        print(‘Download successful‘)
    else:
        print(‘Error occurred:‘)
        print(result.stderr)

except FileNotFoundError:
    print(‘Wget not found. Please install it on your system.‘)

Here we use a try/except block to handle the case where wget is not found on the system. We also capture the output and error streams by passing stdout and stderr arguments, and setting text=True to decode them as strings.

After the subprocess completes, we check its return code to determine if the download was successful. If not, we print the error message captured from stderr.

Recursive Downloads

Downloading individual files is straightforward, but one of wget‘s most powerful features is its ability to recurse through links and download entire directory structures. You can enable this with the -r flag.

For example, here‘s how you might mirror an entire website with Python and wget:

import subprocess

url = ‘https://example.com‘
output_dir = ‘website‘

try:
    subprocess.run([
        ‘wget‘, 
        ‘--recursive‘,
        ‘--page-requisites‘,
        ‘--html-extension‘, 
        ‘--convert-links‘,
        ‘--no-parent‘,
        ‘-P‘, output_dir,
        url
    ])
except:
    print(‘Error downloading website‘)

This code uses several wget flags to customize the recursive download:

--recursive enables recursive downloading through links
--page-requisites downloads CSS, JS, and images needed to render the page
--html-extension saves files with a .html extension
--convert-links updates links to point to local files
--no-parent avoids ascending to parent directories

The result is a local copy of the website saved in the specified output directory, suitable for offline browsing. Note this may take a long time and consume significant disk space for large sites.

Other wget Options

Wget has many other flags to customize its behavior. Here are a few you may find useful:

-O specifies a custom output filename
-t sets the number of retries on failure
--limit-rate throttles the download speed
--wait adds a delay between requests
--user-agent sets a custom User-Agent header
--header adds arbitrary HTTP headers to the request

You can include these and other flags as arguments to subprocess.run(). See the wget manual page for a full list of options.

Alternatives to wget

While wget is a great choice for many download tasks, it‘s not the only option, even when working with Python. Here are a few alternatives to consider:

curl: another popular command line tool for downloading files and making HTTP requests. Has a Python interface through the pycurl library.
urllib: the standard Python library for working with URLs. Provides functions for downloading files and making HTTP requests.
requests: a popular third-party library that provides a nicer API for making HTTP requests from Python. Great for interacting with web APIs.
scrapy: a framework for writing web spiders that can crawl and download content from websites. Provides a high-level API for managing the crawling process.

Each of these tools has its own strengths and weaknesses. For simple one-off downloads, urllib or requests may be all you need. For more complex tasks involving authentication, cookies, or sessions, you might prefer requests or pycurl. And for large-scale web scraping tasks, scrapy is hard to beat.

That said, wget remains a solid choice for many cases due to its simplicity, robustness, and wide availability. It‘s a valuable tool to have in your Python arsenal.

Conclusion

In this article, we‘ve seen how to use wget and Python together to download files and websites. Wget is a powerful tool for automating downloads, and its recursive mode makes it easy to mirror entire sites.

By using the subprocess module, we can invoke wget from a Python script and customize its behavior. This allows us to integrate wget into larger Python applications and workflows.

While there are many other options for downloading files in Python, wget remains a solid choice, especially for complex tasks that require recursion or advanced configuration. With a little bit of Python glue code, wget can be a valuable addition to your web scraping toolkit.

What is wget?

Why Use wget with Python?

Installing wget

Running wget in Python

Recursive Downloads

Other wget Options

Alternatives to wget

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide