Python is a versatile language used for a wide variety of applications, from web development to data analysis to automation. One common task you may encounter is the need to download files or web pages from the internet. While Python provides built-in libraries like urllib for this purpose, sometimes an external tool can make the job even easier. That‘s where wget comes in.
In this article, you‘ll learn how to use Python to interface with the popular wget utility for downloading web content. We‘ll cover the advantages of this approach, walk through installation and setup, and demonstrate how to use wget in Python to download both individual files and entire websites. Let‘s get started!
What is wget?
Wget is a free command line utility for downloading files from the web. It supports the HTTP, HTTPS, and FTP protocols, and can retrieve content recursively, making it a powerful tool for mirroring websites. Wget is widely used for automating downloads and for retrieving content from the command line.
Some key features of wget include:
- Resuming partially completed downloads
- Converting links for local viewing of downloaded content
- Robustly handling network issues and slow connections
- Supporting HTTP cookies and authentication
- Customizing HTTP headers and user agents
- Rate limiting and wait times between requests
Wget is a mature open source project that originated in 1996. It is widely available on Unix-like systems, and ports exist for Windows and other platforms as well.
Why Use wget with Python?
If you‘re working in Python, you may be wondering why you would want to use an external program like wget instead of a native Python library. Here are a few reasons:
-
Simplicity: wget is invoked with a simple command line interface, so you don‘t need to write much code to use it in Python. This can be faster and more straightforward than using a Python library directly.
-
Robustness: wget is a time-tested tool that can handle a variety of error conditions. It has configurable timeout and retry options to help ensure downloads succeed.
-
Recursion: one of wget‘s most powerful features is its ability to recursively download content by following links. This enables you to easily mirror an entire website.
-
Shell parity: if you‘re used to using wget on the command line, you can maintain a similar interface within Python.
-
Subprocesses: using wget allows you to offload the work of downloading to a subprocess. This can help prevent your main Python process from blocking on long-running downloads.
That said, using an external tool like wget also has some downsides. It adds a dependency to your project, and may complicate installation. You also have less programmatic control than with a native library.
For basic download tasks, native Python libraries like urllib, requests, or aiohttp are likely a better choice. But for cases where you need recursion or advanced configuration, wget can be a helpful tool in your Python toolkit.
Installing wget
Before you can use wget in your Python programs, you‘ll need to install it on your system if it‘s not already available. How you do this depends on your operating system.
On many Unix-like systems, including Linux and macOS, wget may already be installed. You can check by opening a terminal and typing wget --version
. If you see version information, you‘re all set. If not, you can install wget using your system‘s package manager.
For example, on Debian/Ubuntu systems, run:
sudo apt update
sudo apt install wget
On macOS, you can install wget with Homebrew:
brew install wget
On Windows, wget binaries are available, but are not included by default. To install, download the latest wget.exe
from the official GnuWin32 site. Add the directory containing wget.exe
to your PATH
environment variable so you can execute it from any location.
Once you have wget installed, you can move on to calling it from Python.
Running wget in Python
To use wget in a Python script, we‘ll need a way to spawn a new process and execute shell commands. Python provides a few ways to do this, but we‘ll use the standard subprocess module.
Here‘s a simple example of running wget from Python to download a file:
import subprocess
url = ‘https://example.com/file.zip‘
output_dir = ‘downloads‘
subprocess.run([‘wget‘, ‘-P‘, output_dir, url])
This code imports the subprocess module and defines the target URL and output directory. It then uses subprocess.run() to execute wget with a few arguments:
-P downloads
specifies the output directory- The URL to download comes last
The subprocess.run() function waits for the wget process to finish before continuing. If the file already exists, wget will check if it is newer than the one on the server. If not, it will skip the download.
You can extend this technique to handle error cases and capture output:
import subprocess
url = ‘https://example.com/file.zip‘
output_dir = ‘downloads‘
try:
result = subprocess.run([‘wget‘, ‘-P‘, output_dir, url],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True)
if result.returncode == 0:
print(‘Download successful‘)
else:
print(‘Error occurred:‘)
print(result.stderr)
except FileNotFoundError:
print(‘Wget not found. Please install it on your system.‘)
Here we use a try/except block to handle the case where wget is not found on the system. We also capture the output and error streams by passing stdout
and stderr
arguments, and setting text=True
to decode them as strings.
After the subprocess completes, we check its return code to determine if the download was successful. If not, we print the error message captured from stderr.
Recursive Downloads
Downloading individual files is straightforward, but one of wget‘s most powerful features is its ability to recurse through links and download entire directory structures. You can enable this with the -r
flag.
For example, here‘s how you might mirror an entire website with Python and wget:
import subprocess
url = ‘https://example.com‘
output_dir = ‘website‘
try:
subprocess.run([
‘wget‘,
‘--recursive‘,
‘--page-requisites‘,
‘--html-extension‘,
‘--convert-links‘,
‘--no-parent‘,
‘-P‘, output_dir,
url
])
except:
print(‘Error downloading website‘)
This code uses several wget flags to customize the recursive download:
--recursive
enables recursive downloading through links--page-requisites
downloads CSS, JS, and images needed to render the page--html-extension
saves files with a.html
extension--convert-links
updates links to point to local files--no-parent
avoids ascending to parent directories
The result is a local copy of the website saved in the specified output directory, suitable for offline browsing. Note this may take a long time and consume significant disk space for large sites.
Other wget Options
Wget has many other flags to customize its behavior. Here are a few you may find useful:
-O
specifies a custom output filename-t
sets the number of retries on failure--limit-rate
throttles the download speed--wait
adds a delay between requests--user-agent
sets a custom User-Agent header--header
adds arbitrary HTTP headers to the request
You can include these and other flags as arguments to subprocess.run(). See the wget manual page for a full list of options.
Alternatives to wget
While wget is a great choice for many download tasks, it‘s not the only option, even when working with Python. Here are a few alternatives to consider:
-
curl: another popular command line tool for downloading files and making HTTP requests. Has a Python interface through the pycurl library.
-
urllib: the standard Python library for working with URLs. Provides functions for downloading files and making HTTP requests.
-
requests: a popular third-party library that provides a nicer API for making HTTP requests from Python. Great for interacting with web APIs.
-
scrapy: a framework for writing web spiders that can crawl and download content from websites. Provides a high-level API for managing the crawling process.
Each of these tools has its own strengths and weaknesses. For simple one-off downloads, urllib or requests may be all you need. For more complex tasks involving authentication, cookies, or sessions, you might prefer requests or pycurl. And for large-scale web scraping tasks, scrapy is hard to beat.
That said, wget remains a solid choice for many cases due to its simplicity, robustness, and wide availability. It‘s a valuable tool to have in your Python arsenal.
Conclusion
In this article, we‘ve seen how to use wget and Python together to download files and websites. Wget is a powerful tool for automating downloads, and its recursive mode makes it easy to mirror entire sites.
By using the subprocess module, we can invoke wget from a Python script and customize its behavior. This allows us to integrate wget into larger Python applications and workflows.
While there are many other options for downloading files in Python, wget remains a solid choice, especially for complex tasks that require recursion or advanced configuration. With a little bit of Python glue code, wget can be a valuable addition to your web scraping toolkit.