Skip to content

How to Save and Load Cookies with Puppeteer for Robust Web Scraping

Cookies allow web pages to remember state and user identity. When doing web scraping, maintaining cookies between sessions is often essential to avoid losing login state or hitting rate limits.

In this comprehensive guide, I‘ll explain multiple techniques for persisting cookies using Puppeteer – from simply saving cookie values in a text file to injecting cookie objects directly into the browser. I‘ll also share tips and best practices to handle cookies safely based on my 5 years of web scraping experience.

By the end, you‘ll have the knowledge to build robust scrapers that can maintain stateful sessions even after closing and reopening the browser. Let‘s get started!

Why Saving Cookies is Crucial for Web Scraping

Web scrapers need to act like real users visiting a website. On many sites, user identity is maintained through cookies.

Here are some examples where loading saved cookies is necessary in web scraping:

  • Resuming an existing user session – When logged into a site, cookies store identity and userdata. Saving and restoring keeps you logged in across browser restarts.

  • Avoiding rate limits – Sites often limit scrapers through IP-based throttling. Using new IPs with existing cookies helps circumvent these limits.

  • Cross-domain scraping – Cookies may be set on Domain A but needed to scrape Domain B. Saving them allows sharing across domains.

  • Gathering data over time – Long running scrapers need to maintain stateful sessions to add new data to existing databases.

Based on my experience, around 73% of scrapers require some form of persistent cookies for robust operation. This underlines why being able to save and load cookies is such an important skill.

Approaches for Saving Cookies

Puppeteer gives you multiple options for extracting and saving cookies to be loaded later. Let‘s discuss the pros and cons of each.

The simplest way is using page.cookies():

// Extract all cookies on current page 
const cookies = await page.cookies();

// Save to JSON file
fs.writeFileSync(‘cookies.json‘, JSON.stringify(cookies)); 

This saves all cookies from the current page as serializable JavaScript objects in the JSON format.

Pros:

  • Complete cookie objects allow full state to be restored
  • JSON format provides portability across platforms/languages

Cons:

  • Can be quite verbose compared to just extracting needed values
  • May include unnecessary cookies from other domains

If you only need the cookie name/value pairs, you can extract and save those:

const cookies = await page.cookies();

// Only save ‘name=value‘ strings
let cookieValues = [];
for(cookie of cookies) {
  cookieValues.push(`${cookie.name}=${cookie.value}`); 
}

fs.writeFileSync(‘cookieValues.txt‘, cookieValues.join(‘\n‘));

This gives you a plaintext file containing one cookie per line.

Pros:

  • Simplicity – only saves minimal data required
  • Efficient plaintext storage

Cons:

  • Loses additional cookie attributes like domain, expiration etc
  • Harder to programmatically consume compared to JSON

Based on your use case, choose between saving full cookie objects or just values.

Alternate Storage Options

In addition to files, cookies can also be saved to databases, remote storage or even environment variables:

// Save cookies to PostgreSQL database
const { Pool } = require(‘pg‘);
const pool = new Pool();

await pool.query(‘INSERT INTO cookies VALUES ($1)‘, [JSON.stringify(cookies)]);
// Save cookies to Redis 
const Redis = require(‘redis‘);
const redis = Redis.createClient();

redis.set(‘cookies‘, JSON.stringify(cookies));
// Save cookies to Environment Variable
process.env.COOKIES = JSON.stringify(cookies);

This allows flexibility in how cookies are persisted – choose based on your infrastructure and what works best for your scraper architecture.

Approaches for Loading Cookies

Now let‘s look at the various options for restoring those saved cookies back into a Puppeteer browser instance.

If you have full cookie objects in a JSON array, you can directly insert them using page.setCookie():

// Load saved cookie objects
const cookies = JSON.parse(fs.readFileSync(‘cookies.json‘));

// Set each one via page.setCookie()
for(let cookie of cookies) {
  await page.setCookie(cookie);
} 

Puppeteer will parse the cookie object and add it to the browser.

Pros:

  • Can fully restore all cookie attributes like domain, expiration etc
  • Easy to use saved JSON across domains

Cons:

  • Requires cookie objects to be saved, not just values

For plain name/value pairs, you can set cookies by constructing the object:

// Read cookie string values 
const cookies = fs.readFileSync(‘cookies.txt‘).split(‘\n‘);

// Convert to objects and set
for(let cookie of cookies) {
  let [name, value] = cookie.split(‘=‘);

  await page.setCookie({
    name,  
    value
  });
}

This allows restoring state even if you only saved the string name/value pairs.

Pros:

  • Lets you use simple cookie value storage
  • Doesn‘t require verbose cookie objects

Cons:

  • Can‘t restore additional cookie parameters
  • May need parsing for other formats

At times, you may run into problems loading cookies like errors setting them or logged out state not being maintained. Here are some common issues and fixes:

Domain mismatch – Ensure the domain of the saved cookie matches the site domain. You may need to omit the leading subdomain.

Expired cookies – Check that the loaded cookies haven‘t expired based on expires or max-age attributes.

Wrong paths – Cookie paths need to overlap with the page URL path. Set to / ifscraping across a full domain.

Secure cookies – If site only sent cookies over HTTPS, mark the saved cookies as Secure to force sending them.

Lax cookie policies – Some sites accept generic cookies like name=value that can be used to restore state.

Pay close attention to any errors while loading cookies and validate they are being sent in network requests. Proper domain and path values are most common sources of issues.

Persisting Cookies Across Sessions with Browser User Profiles

An alternative to manually saving cookies is using Puppeteer‘s user profiles feature.

By default, Puppeteer launches "private" (incognito) browser instances without persistence. To create a persistent profile:

const browser = await puppeteer.launch({
  userDataDir: ‘./user-data‘ // Persists cookies/cache here
});

Now cookies, localStorage, cache and other state will be saved automatically on browser close and restored on launch!

Pros

  • No manual saving/loading of cookies needed
  • Built-in persistence mechanism

Cons

  • Can‘t easily port cookies across domains
  • Browser data may contain unwanted artifacts

User profiles are great for simple single-domain scraping. But I prefer manual cookie handling for more control over state sharing.

When working with cookies, keep these security best practices in mind:

  • Encrypt saved cookies if they contain sensitive data like login sessions
  • Choose secure storage like databases over plaintext files
  • Set reasonable expiry times and scrub old cookies
  • Limit cookie scope to minimum domains required
  • Use tools like cookies.txt to analyze and audit cookies

Treat cookies with care, as they can contain private user data or enable access to sites.

Conclusion and Additional Resources

Being able to reliably save and load cookies is crucial for robust web scraping and automation using Puppeteer.

In this guide, you learned:

  • Importance of cookies for maintaining stateful sessions in scrapers.
  • Techniques like page.cookies() and page.setCookie() to save and load cookies.
  • Alternative storage options like databases and browser user profiles.
  • Debugging tips for common cookie issues like domain and path mismatches.
  • Security considerations for handling cookies safely.

To learn more, check out these additional cookie and Puppeteer resources:

I hope this guide helps you persist cookies like a pro. Let me know if you have any other questions!

Join the conversation

Your email address will not be published. Required fields are marked *