How to Load Local Files in Playwright for Web Scraping

Here is a 2000+ word blog post on "How to load local files in Playwright?":

Playwright is a popular browser automation library for web scraping and end-to-end testing. One of the handy features of Playwright is the ability to load local files on your filesystem instead of making requests to remote servers. This allows you to test and debug your Playwright scripts offline using local test pages.

In this comprehensive guide, we‘ll cover everything you need to know about loading local files in Playwright, including:

What are the benefits of loading local files?
How to load HTML, JSON, images, and other files.
Tips for handling relative file paths.
Example code snippets for Python, JavaScript, and TypeScript.
Common pitfalls and troubleshooting advice.
Best practices for using local files in a CI/CD pipeline.

After reading, you‘ll have expert-level knowledge of working with local files in Playwright for mocking responses, building scrapers, and more!

Benefits of Loading Local Files

Here are some of the main benefits of loading local files in Playwright:

Work offline: Test and develop scripts without an internet connection. No need to rely on remote servers being available.
Faster performance: Fetching from the local disk is faster than making network requests.
Control test data: Have full control over test file contents instead of relying on unpredictable live sites.
Mock responses: Stub remote API responses with local JSON files.
Privacy: Avoid sending requests to third-party sites during development.
Prototype scrapers: Build scrapers against a local HTML copy before targeting live sites.
Consistent tests: Local files behave the same every time, giving reliable automated tests.

For these reasons, loading local files can boost productivity and test stability when working with Playwright.

Loading HTML Files

To load an HTML file in Playwright, use a file:// URL and provide the absolute file path.

Here is an example in Python:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
  browser = p.chromium.launch()
  page = browser.new_page()

  page.goto("file:///home/user/local-test.html")

  print(page.content())

And in JavaScript:

const { chromium } = require(‘playwright‘);

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto(‘file:///home/user/local-test.html‘);

  console.log(await page.content());

  await browser.close();  
})();

The key things to note are:

Use a file:// URL with three slashes (/).
Specify the absolute file path, not a relative path.
The path should point to a static HTML file, not a directory.
Playwright will load and process the HTML just like a normal web page.

Once loaded, you can interact with the DOM, call page methods like page.click(), assert contents, and more.

Accessing Relative File Paths

When loading a local HTML file, relative paths won‘t work by default. For example:

<!-- local-test.html -->

<img src="images/logo.png">

The image link will fail because it uses a relative path.

To fix this, you need to provide a baseURL when creating the browser context:

browser.new_context(baseURL="file:///home/user/")

Now relative paths will resolve correctly within this context.

Loading JSON Files

Local JSON files can be useful for stubbing API responses during development.

To load a JSON file in Playwright, use page.route() to intercept network requests:

import json

page.route("**/data.json", lambda route: route.fulfill(
  content_type="application/json",
  body=json.dumps({
    "mock_key": "mock_response"  
  }))
)

Now any requests to /data.json will be fulfilled with the local mock JSON data.

The same approach works for mocking images, PDFs, or any other file type – just return the binary file content.

Loading Images and Other Files

To load a local image or file, set it as the src of an HTML element:

<!-- local-test.html -->

<img src="file:///home/user/image.png">

Or directly set it as the page content:

with open("image.png", "rb") as f:
  image_data = f.read()

page.set_content(image_data)

This will render the image or display the file contents.

For loading files like PDFs, you may need to set the appropriate Content-Type header.

Playwright Code Examples

Here are some full code examples for loading different file types in Playwright.

HTML

Python

page.goto("file:///home/user/local-test.html")

JavaScript

await page.goto(‘file:///home/user/local-test.html‘);

TypeScript

await page.goto(‘file:///home/user/local-test.html‘);

JSON

Python

import json

page.route("**data.json", lambda route: route.fulfill(
  content_type="application/json",
  body=json.dumps({"mock_key": "mock_value"})
))

JavaScript

page.route(‘**data.json‘, route => {
  route.fulfill({
    contentType: ‘application/json‘,
    body: JSON.stringify({mock_key: ‘mock_value‘}),
  });
});

TypeScript

page.route(‘**data.json‘, route => {
  route.fulfill({
    contentType: ‘application/json‘,
    body: JSON.stringify({mock_key: ‘mock_value‘}), 
  });
});

Images

Python

page.set_content(open("image.png", "rb").read())

JavaScript

const imgBuffer = fs.readFileSync(‘image.png‘);
await page.setContent(imgBuffer);

TypeScript

const imgBuffer = fs.readFileSync(‘image.png‘); 
await page.setContent(imgBuffer);

PDFs

Python

with open("doc.pdf", "rb") as f:
  pdf_content = f.read()

page.set_content(pdf_content, headers={"Content-Type": "application/pdf"})

JavaScript

const pdfBuffer = fs.readFileSync(‘doc.pdf‘);

await page.setContent(pdfBuffer, {
  contentType: ‘application/pdf‘,
});

TypeScript

const pdfBuffer = fs.readFileSync(‘doc.pdf‘);

await page.setContent(pdfBuffer, {
  contentType: ‘application/pdf‘, 
});

As you can see, the approach is very similar across languages – the main differences are in how you read the file data.

Troubleshooting Local File Loading

Here are some common issues and solutions when loading local files with Playwright:

404 File Not Found

Double check the file path is absolute, not relative.
Verify the file exists at that location on disk.
Check filename case sensitivity on Linux/macOS.

Cross-Origin Request Blocked

This occurs if your test page requests resources from a remote server. Start by loading only local resources.

Mixed Content Warnings

Can happen if page loads HTTP resources while on a HTTPS file URL. Use a file:// URL instead of https://.

Allow File Access in Chrome

Chrome may block local file access unless you start it with --allow-file-access-from-files flag.

Sandbox Issues

Some environments like Docker restrict file access. May need to launch Chrome with --no-sandbox.

Relative Paths Not Working

Set a base URL on the browser context to handle relative paths correctly.

Encoding Issues

Binary file contents may have encoding issues. Handle files as byte buffers instead of text.

Checking for these common problems will help resolve most local file loading issues.

Local File Best Practices

Here are some best practices to follow when using local files in Playwright:

Keep production and test code separate – Don‘t use local files in your main codebase. Only use them in tests.
Commit local files to source control – Add your local test files to Git/GitHub to share with other developers.
Use descriptive filenames – Like mock-api-response.json instead of file1.json.
Load once, reuse everywhere – Load local files in before() hooks and reuse across tests.
Use variables for paths – Avoid hardcoding file paths; use variables like LOCAL_HTML_PATH instead.
Serve files locally – For full end-to-end tests, run a local dev server to serve test files.
Clean up when done – Delete temporary local files after your test run finishes.

By following these tips, you can robustly incorporate local files into your Playwright test suites.

Using Local Files in CI/CD

For CI/CD environments like GitHub Actions, there are a couple useful techniques for dealing with local test files:

Commit files directly – Add test files directly to your repo. Then GitHub Actions can access them.
Bundle files in workflow – Upload test files as workflow artifacts that get passed between jobs.
Generate files dynamically – Have CI workflow generate files on the fly to avoid committing them.
Use file server – Run a local file server within CI and access files over HTTP.
Cache files – Cache local test files between CI runs for faster performance.

Overall, it‘s best to avoid relying on the CI server‘s local filesystem if possible. Committing files directly keeps things simple in most cases.

Conclusion

Loading local files is a handy trick for creating faster, more reliable tests with Playwright. Mocking responses, previewing scrapers, and working offline are just a few benefits.

With Playwright‘s file:// protocol, routing features, and content manipulation APIs, you have all the tools needed to incorporate local files into your browser automation scripts. Just be sure to use absolute file paths and handle binary data with care.

Following the examples and best practices in this guide will give you expert-level familiarity with local file loading in Playwright. So ditch those remote servers while developing your next web scraping or testing tool!

Benefits of Loading Local Files

Loading HTML Files

Accessing Relative File Paths

Loading JSON Files

Loading Images and Other Files

Playwright Code Examples

HTML

JSON

Images

PDFs

Troubleshooting Local File Loading

Local File Best Practices

Using Local Files in CI/CD

Conclusion

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python