How to save and load cookies in Playwright? | ScrapingBee - Web Scraping Site

Saving and Loading Cookies in Playwright for Web Scraping

When performing web scraping and automation tasks with Playwright, being able to save and load cookies can be extremely useful. Cookies allow websites to store session information and maintain a logged-in state. By saving cookies and loading them in subsequent script runs, you can avoid having to repeatedly log in, maintain session continuity, and even bypass certain anti-bot measures.

In this article, we‘ll take an in-depth look at how to work with cookies in Playwright using Python. I‘ll show you exactly how to save cookies to a file and load them again later. We‘ll walk through a full code example to demonstrate the process from start to finish.

Whether you‘re new to Playwright or an experienced user looking to take your web automation to the next level, understanding how to manage cookies is an essential skill. It will help make your scripts more efficient, robust and stealthy. Let‘s dive in!

What are cookies and why are they important?

Cookies are small pieces of data that websites store in a user‘s browser. They contain information about the user‘s interactions with the site, such as login credentials, preferences, and session data. When a user visits the site again, the browser sends back any relevant cookies, allowing the website to recognize the user and maintain their logged-in session.

For web scraping, being able to save and load cookies provides several key benefits:

Maintain logged-in sessions: If your scraping task requires being logged in to access certain pages or data, you can avoid having to enter credentials each time by saving the auth cookies and loading them in subsequent runs. This saves time and makes your script more efficient.
Bypass rate limiting: Many websites use rate limiting and other anti-bot measures that track IP addresses and cookies to throttle suspected bots. By rotating your cookies and IP with each request, you can avoid triggering these defenses and collect data more reliably.
Improve performance: Logging in programmatically can take some time, especially if you need to wait for pages to load or solve CAPTCHAs. Loading pre-saved cookies allows you to skip this process and get straight to scraping.

Now that we understand why cookies matter for web scraping, let‘s look at how to actually save and load them in Playwright.

How to save cookies in Playwright

Playwright provides a simple way to retrieve all cookies for a given browser context using the context.cookies() method. This returns an array of cookie objects, each containing the name, value, domain, path, expiration, and other attributes of the cookie.

Here‘s a basic example of getting the current cookies and saving them to a JSON file:

import json
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    context = browser.new_context() 
    page = context.new_page()
    page.goto("https://example.com")

    # Get cookies and save to file
    cookies = context.cookies()
    with open("cookies.json", "w") as f:
        json.dump(cookies, f)

    browser.close()

This code launches a new browser instance, creates a context and page, and navigates to a URL. We then call context.cookies() to get all current cookies for the context. Finally, we use the json module to save the cookies array to a file named cookies.json.

The JSON format is ideal for storing cookies because it can fully represent their nested structure and data types. Here‘s an example of what the contents of cookies.json might look like:

[
  {
    "name": "session_id",
    "value": "abc123",
    "domain": "example.com",
    "path": "/",
    "expires": 1622851200.111111, 
    "httpOnly": true,
    "secure": false, 
    "sameSite": "Lax"
  },
  {
    "name": "user_token",  
    "value": "def456",
    "domain": "example.com",
    "path": "/",
    "expires": 1625443200.222222,
    "httpOnly": true, 
    "secure": true,
    "sameSite": "Strict"    
  }
]

Each cookie is represented by a dictionary object containing its name-value pair and additional attributes controlling when and how the cookie is sent by the browser.

How to load cookies in Playwright

Loading saved cookies in Playwright is done using the context.add_cookies() method. This takes an array of cookie objects and adds them to the browser context.

Continuing our example from earlier, here‘s how we can load the saved cookies from cookies.json:

import json
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    context = browser.new_context()

    # Load cookies from file 
    with open("cookies.json", "r") as f:
        cookies = json.load(f)
    context.add_cookies(cookies)

    # Cookies now loaded, navigate to page
    page = context.new_page()
    page.goto("https://example.com") 

    browser.close()

We start by launching the browser and creating a new context as before. But before navigating to a page, we load the cookies from cookies.json using json.load(). This parses the JSON data into a Python array of cookie objects, which we then pass to context.add_cookies().

The add_cookies() method adds each cookie in the array to the context, overwriting any existing cookies with the same name, domain, and path. Once the cookies are loaded, we can navigate to a page and the browser will include the loaded cookies in its request. This allows us to pick up where we left off in terms of auth and session state.

Full code example

Here‘s the full code for saving and loading cookies in Playwright, with extra comments and print statements added for clarity:

import json
from playwright.sync_api import sync_playwright

# Function to save cookies to a file
def save_cookies(context, file_path):
    cookies = context.cookies()
    with open(file_path, "w") as f:
        json.dump(cookies, f)
    print(f"Cookies saved to {file_path}")

# Function to load cookies from a file 
def load_cookies(context, file_path):
    with open(file_path, "r") as f:  
        cookies = json.load(f)
    context.add_cookies(cookies)
    print(f"Cookies loaded from {file_path}")

# Example usage
with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()

    # Navigate to login page and fill form
    page = context.new_page() 
    page.goto("https://example.com/login")
    page.fill("input[name=‘username‘]", "user")
    page.fill("input[name=‘password‘]", "pass")
    page.click("button[type=‘submit‘]")

    # Wait for navigation to complete
    page.wait_for_load_state("networkidle")
    print("Logged in")

    # Save cookies to file
    save_cookies(context, "cookies.json")

    # Close page but keep context open
    page.close()  

    # Load cookies into new page
    page2 = context.new_page()
    load_cookies(context, "cookies.json") 

    # Navigate to page, should still be logged in
    page2.goto("https://example.com/dashboard") 
    print(page2.title())

    browser.close()

This example shows the full flow of logging into a site, saving the cookies, and loading them into a new page to maintain the logged-in session. We‘ve split out the saving and loading of cookies into separate functions for reusability.

After launching the browser, we create a context and page, navigate to the login URL and fill in the username and password fields. Once logged in, we grab the cookies and save them to cookies.json.

We then close the page but keep the same context open. This ensures that the loaded cookies will be associated with the same context. We create a new page, load the cookies from file, and navigate to an internal page. If the cookies were loaded correctly and are still valid, we should remain logged in and be able to access the dashboard. Printing the page title allows us to confirm this.

Of course, the specific login flow and page URLs will differ based on the site you‘re working with. But the general principles of saving cookies after logging in and loading them to maintain auth state remain the same.

Tips and best practices

Here are a few tips and best practices to keep in mind when working with cookies in Playwright:

Rotate user agent and IP with cookies. Cookies are not the only way websites track and identify users. They often look at user agent strings, IP addresses, and other browser fingerprinting signals as well. For the most robust and stealthy scraping, it‘s a good idea to rotate these attributes along with cookies. Playwright allows you to easily set a custom user agent and proxy for each browser context.
Periodically refresh cookies. Cookies can expire or become invalid over time, especially if the website changes its session handling. To avoid getting logged out unexpectedly, it‘s a good idea to periodically log in and get fresh cookies, saving them to your cookies file. You can automate this process and run it on a schedule to keep your cookies up to date.
Use different cookie profiles for different sites and tasks. If you‘re scraping multiple websites or performing different types of tasks, it‘s best to maintain separate cookie profiles for each. This prevents any cross-contamination and keeps your scraping more focused and efficient. You can save cookies to different files or directories based on the website or job.
Be mindful of cookie consent banners and popups. Many websites now display cookie consent notices due to GDPR and other regulations. These can sometimes interfere with your scraping by covering content or limiting functionality. You may need to add code to accept or dismiss these popups before proceeding with your task. Playwright‘s page.click() and page.wait_for_selector() methods can help automate interacting with these UI elements.

Conclusion

Saving and loading cookies is a powerful technique for performing efficient and stealthy web scraping with Playwright. By persisting auth state across sessions, you can avoid repeatedly logging in, bypass rate limits, and collect data more quickly and reliably.

Playwright‘s context.cookies() and context.add_cookies() methods make saving and loading cookies straightforward. We walked through a complete example of retrieving current cookies, storing them as JSON, and loading them into a new page to maintain session continuity.

To get the most out of working with cookies in Playwright, remember to rotate other identifiers like user agent and IP address along with cookies. Keep your cookies fresh by periodically logging in and getting new ones, and use separate cookie profiles for different websites and scraping tasks.

With the techniques and best practices covered in this guide, you‘ll be well on your way to mastering cookies in Playwright and taking your web scraping to the next level. So go forth and scrape responsibly!

How to save and load cookies in Playwright? | ScrapingBee

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide