If you need to automate downloading files from websites, the Playwright library for Python makes it easy. Playwright allows you to launch and control a real browser programmatically. You can navigate to web pages, interact with elements, and trigger downloads, just like a human user would.
In this guide, we‘ll walk through how to use Playwright in Python to download files from the web. We‘ll cover two methods:
- Clicking a download link/button directly
- Extracting the download URL and using requests
We‘ll also look at some tips and best practices to make your download automation robust and efficient. Let‘s get started!
Launching a Browser and Navigating to a Download Page
First, make sure you have Playwright installed. You can install it using pip:
pip install playwright
Now let‘s write some Python code to launch a browser instance and navigate to a web page containing a file we want to download. Here‘s a basic example:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://example.com/download")
# Rest of code here
browser.close()
This code launches a new Chromium browser window (you can also use Firefox or WebKit) and navigates to https://example.com/download. The headless=False
argument makes the browser visible so you can see what‘s happening.
Downloading Files by Clicking Links/Buttons
Most of the time, downloading a file is as simple as clicking a link or button on the page. With Playwright, you can automate this process.
Let‘s say the download page has a button with the ID "download-btn" that triggers the file download when clicked. Here‘s how we can click it using Playwright:
download_button = page.locator("#download-btn")
download_button.click()
The locator
method finds an element on the page based on a selector (in this case, an ID selector). We store a reference to the button element in the download_button
variable. Then we simply call click()
on it to simulate a mouse click.
That‘s all it takes! The browser will initiate the download and save the file automatically. By default, it will be saved to the default downloads folder on your system.
If the download link is an <a>
tag instead of a button, you can use the same approach to click it:
download_link = page.locator("a.download-link")
download_link.click()
Waiting for Downloads to Finish
For short downloads, the file will probably be saved almost instantly. But for larger files, you may need to wait for the download to finish before moving on in your script.
Playwright provides a helpful expect_download
method that lets you wait for a download to complete. Here‘s how you can use it:
# Start waiting for the download
with page.expect_download() as download_info:
download_button.click()
# Wait for the download to finish
download = download_info.value
This tells Playwright to wait for the next download that is initiated after this line. When you call click()
inside the with
block, Playwright will automatically wait for that download to finish before continuing.
The finished download
object has some useful properties, like download.path
which returns the full path where the file was saved.
Downloading Files Using a URL and requests
Sometimes clicking a download link or button doesn‘t actually download the file, but opens it in the browser instead. In this case, you‘ll need to get the actual download URL and retrieve the file data manually.
You can extract the URL from link elements like this:
download_url = page.locator("a.download-link").get_attribute("href")
This finds an <a>
element with the class "download-link" and gets the value of its href
attribute, which contains the URL.
Once you have the URL, you can use the requests library to download the file data. First install requests if you don‘t have it already:
pip install requests
Then add this code to your script:
import requests
# Download the file data
file_data = requests.get(download_url).content
# Save the data to a local file
with open("downloaded_file.pdf", "wb") as file:
file.write(file_data)
This sends a GET request to the download URL, retrieves the raw file data, and saves it to a local file named "downloaded_file.pdf". The "wb"
mode ensures the file is written in binary format, which is necessary for files like PDFs and images.
Configuring Where Files are Downloaded
By default, Playwright will download files to the standard downloads folder on your system. If you want to specify a custom location, you can pass a downloads_path
argument when launching the browser:
browser = p.chromium.launch(downloads_path="/path/to/download/folder")
All files downloaded by this browser instance will be saved to the specified folder.
Handling Different File Types
The basic download process is the same for all types of files. However, you may need to adapt your code slightly depending on the file type.
For example, when saving raw file data using requests
, make sure to open the local file in the appropriate mode:
- For text files (TXT, CSV, etc.), use
"w"
mode - For binary files (PDF, XLSX, JPG, etc.), use
"wb"
mode
If you‘re extracting download URLs from links on the page, pay attention to the file extension in the URL so you know how to save it locally. You can also look at the Content-Type
header in the download response to determine the file type.
Troubleshooting Download Issues
Downloads can sometimes fail for various reasons. Here are a few things to check if your script isn‘t downloading files correctly:
- Make sure you‘re waiting for the download to finish before trying to access or save the file. Use
expect_download
for this. - Check that your locator is finding the correct download link/button on the page. Print the element text or attributes to verify.
- Look for any errors or unexpected behaviors in the browser window, especially if running in non-headless mode. You may need to add delays or wait for certain elements to appear before clicking.
- Inspect the network activity in the browser‘s Developer Tools to see if the download request is being made and what the server response is. You can also log the download URL and make sure it‘s correct.
Conclusion
Downloading files using Playwright and Python is a powerful technique for automating data extraction from websites. With just a few lines of code, you can launch a browser, navigate to download pages, click links and buttons to trigger downloads, and retrieve files programmatically.
Whether you need to download files to collect datasets, backup web content, audit website resources, or any other purpose, Playwright makes it easy and flexible. Combine it with other Python libraries like requests and you can build robust download systems.
The key things to remember are:
- Launch a browser with
sync_playwright
and navigate to the download page - Locate and click the download link/button using
page.locator(...).click()
- Wait for the download to finish with
expect_download
- Alternatively, extract the download URL and use
requests.get(...).content
to retrieve the file data - Save downloaded files in binary mode for non-text formats
With these techniques, you‘ll be able to automate downloading files from any website with ease. Happy downloading!