How to Download Files Using Playwright and Python

If you need to automate downloading files from websites, the Playwright library for Python makes it easy. Playwright allows you to launch and control a real browser programmatically. You can navigate to web pages, interact with elements, and trigger downloads, just like a human user would.

In this guide, we‘ll walk through how to use Playwright in Python to download files from the web. We‘ll cover two methods:

Clicking a download link/button directly
Extracting the download URL and using requests

We‘ll also look at some tips and best practices to make your download automation robust and efficient. Let‘s get started!

Launching a Browser and Navigating to a Download Page

First, make sure you have Playwright installed. You can install it using pip:

pip install playwright

Now let‘s write some Python code to launch a browser instance and navigate to a web page containing a file we want to download. Here‘s a basic example:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto("https://example.com/download")

    # Rest of code here

    browser.close()

This code launches a new Chromium browser window (you can also use Firefox or WebKit) and navigates to https://example.com/download. The headless=False argument makes the browser visible so you can see what‘s happening.

Downloading Files by Clicking Links/Buttons

Most of the time, downloading a file is as simple as clicking a link or button on the page. With Playwright, you can automate this process.

Let‘s say the download page has a button with the ID "download-btn" that triggers the file download when clicked. Here‘s how we can click it using Playwright:

download_button = page.locator("#download-btn")
download_button.click()

The locator method finds an element on the page based on a selector (in this case, an ID selector). We store a reference to the button element in the download_button variable. Then we simply call click() on it to simulate a mouse click.

That‘s all it takes! The browser will initiate the download and save the file automatically. By default, it will be saved to the default downloads folder on your system.

If the download link is an <a> tag instead of a button, you can use the same approach to click it:

download_link = page.locator("a.download-link")
download_link.click()

Waiting for Downloads to Finish

For short downloads, the file will probably be saved almost instantly. But for larger files, you may need to wait for the download to finish before moving on in your script.

Playwright provides a helpful expect_download method that lets you wait for a download to complete. Here‘s how you can use it:

# Start waiting for the download
with page.expect_download() as download_info:
    download_button.click()

# Wait for the download to finish
download = download_info.value

This tells Playwright to wait for the next download that is initiated after this line. When you call click() inside the with block, Playwright will automatically wait for that download to finish before continuing.

The finished download object has some useful properties, like download.path which returns the full path where the file was saved.

Downloading Files Using a URL and requests

Sometimes clicking a download link or button doesn‘t actually download the file, but opens it in the browser instead. In this case, you‘ll need to get the actual download URL and retrieve the file data manually.

You can extract the URL from link elements like this:

download_url = page.locator("a.download-link").get_attribute("href")

This finds an <a> element with the class "download-link" and gets the value of its href attribute, which contains the URL.

Once you have the URL, you can use the requests library to download the file data. First install requests if you don‘t have it already:

pip install requests

Then add this code to your script:

import requests

# Download the file data
file_data = requests.get(download_url).content

# Save the data to a local file
with open("downloaded_file.pdf", "wb") as file:
    file.write(file_data)

This sends a GET request to the download URL, retrieves the raw file data, and saves it to a local file named "downloaded_file.pdf". The "wb" mode ensures the file is written in binary format, which is necessary for files like PDFs and images.

Configuring Where Files are Downloaded

By default, Playwright will download files to the standard downloads folder on your system. If you want to specify a custom location, you can pass a downloads_path argument when launching the browser:

browser = p.chromium.launch(downloads_path="/path/to/download/folder")

All files downloaded by this browser instance will be saved to the specified folder.

Handling Different File Types

The basic download process is the same for all types of files. However, you may need to adapt your code slightly depending on the file type.

For example, when saving raw file data using requests, make sure to open the local file in the appropriate mode:

For text files (TXT, CSV, etc.), use "w" mode
For binary files (PDF, XLSX, JPG, etc.), use "wb" mode

If you‘re extracting download URLs from links on the page, pay attention to the file extension in the URL so you know how to save it locally. You can also look at the Content-Type header in the download response to determine the file type.

Troubleshooting Download Issues

Downloads can sometimes fail for various reasons. Here are a few things to check if your script isn‘t downloading files correctly:

Make sure you‘re waiting for the download to finish before trying to access or save the file. Use expect_download for this.
Check that your locator is finding the correct download link/button on the page. Print the element text or attributes to verify.
Look for any errors or unexpected behaviors in the browser window, especially if running in non-headless mode. You may need to add delays or wait for certain elements to appear before clicking.
Inspect the network activity in the browser‘s Developer Tools to see if the download request is being made and what the server response is. You can also log the download URL and make sure it‘s correct.

Conclusion

Downloading files using Playwright and Python is a powerful technique for automating data extraction from websites. With just a few lines of code, you can launch a browser, navigate to download pages, click links and buttons to trigger downloads, and retrieve files programmatically.

Whether you need to download files to collect datasets, backup web content, audit website resources, or any other purpose, Playwright makes it easy and flexible. Combine it with other Python libraries like requests and you can build robust download systems.

The key things to remember are:

Launch a browser with sync_playwright and navigate to the download page
Locate and click the download link/button using page.locator(...).click()
Wait for the download to finish with expect_download
Alternatively, extract the download URL and use requests.get(...).content to retrieve the file data
Save downloaded files in binary mode for non-text formats

With these techniques, you‘ll be able to automate downloading files from any website with ease. Happy downloading!

Launching a Browser and Navigating to a Download Page

Downloading Files by Clicking Links/Buttons

Waiting for Downloads to Finish

Downloading Files Using a URL and requests

Configuring Where Files are Downloaded

Handling Different File Types

Troubleshooting Download Issues

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide