If you‘re into web scraping, browser automation, or just like to experiment with new coding tools, you may have heard of Puppeteer and Jupyter notebooks. But have you ever considered using them together? In this post, we‘ll explore how combining these two powerful tools can open up a whole new world of possibilities.
What are Puppeteer and Jupyter Notebooks?
Puppeteer is a Node.js library that allows you to control a headless Chrome browser programmatically. It‘s perfect for tasks like web scraping, automated testing, and generating screenshots or PDFs of web pages. With Puppeteer, you can write scripts to automate just about anything you can do manually in a browser.
Jupyter notebooks, on the other hand, are interactive coding environments that run in your web browser. They allow you to write and execute code, see the results inline, and mix in text, images, and other media. Jupyter started as a Python-only tool but now supports over 40 programming languages.
Both of these tools have seen rapid adoption in recent years. According to the State of JS 2020 survey, Puppeteer is used by 41% of respondents, making it the most popular browser automation tool. Jupyter, meanwhile, is used by over 8 million data scientists, researchers, and developers worldwide.
The Challenge: Async Await in Jupyter
So, you might be thinking, why not just install Puppeteer in a Python-based Jupyter notebook and start automating? Well, it‘s not quite that simple.
The issue lies in the way Puppeteer is designed. It heavily uses the async/await
syntax for handling asynchronous operations, like waiting for pages to load or elements to appear on the page. This is essential for reliable browser automation – without it, your script would just plow ahead without waiting for things to happen, leading to errors and flaky behavior.
Unfortunately, the default Python kernel in Jupyter doesn‘t support async/await. There are some workarounds using libraries like asyncio, but they can get messy and don‘t provide a seamless experience.
The Solution: An Async-Friendly JavaScript Kernel
Fortunately, where there‘s a will, there‘s a way. And in this case, the way is a special version of the iJavaScript Jupyter kernel that‘s been patched to support async/await.
Here‘s how to set it up:
-
First, make sure you have Jupyter installed. If not, you can install it using pip:
pip install jupyter
-
Next, install the patched iJavaScript kernel globally using npm:
npm install -g ijavascript-await
-
Finally, in the directory where you‘ll be creating your notebooks, install Puppeteer:
npm install puppeteer
That‘s it! You can now launch Jupyter using the command ijsnotebook
and create a new notebook with the "JavaScript (async)" kernel.
Putting Puppeteer to Work
Now for the fun part – automating the web! Here‘s a slightly more complex example that demonstrates some of what Puppeteer can do:
let puppeteer = require(‘puppeteer‘);
(async function main() {
try {
let browser = await puppeteer.launch();
let page = await browser.newPage();
await page.goto(‘https://news.ycombinator.com‘);
// Get the top 10 story titles
let titles = await page.evaluate(() => {
return [...document.querySelectorAll(‘.storylink‘)]
.slice(0, 10)
.map(link => link.textContent);
});
console.log(‘Top 10 Hacker News Stories:‘);
console.log(titles);
// Click the "More" link to load the next page
await page.click(‘a.morelink‘);
await page.waitForSelector(‘a.storylink‘);
// Take a screenshot of the page
await page.screenshot({path: ‘hackernews.png‘});
await browser.close();
} catch (err) {
console.error(err);
}
})();
This script does the following:
- Launches a new browser instance and opens a new page
- Navigates to the Hacker News website
- Scrapes the titles of the top 10 stories using
page.evaluate
- Clicks the "More" link to load the next page of stories
- Waits for the new stories to load
- Takes a screenshot of the page
- Closes the browser
Running this in a Jupyter notebook, you‘ll see the top 10 story titles printed out, and a screenshot of the second page of Hacker News will be saved to your notebook directory.
Tips and Tricks
Here are a few things I‘ve learned from using Puppeteer in Jupyter:
-
Always
await
promise-returning functions. This is crucial for proper sequencing of your automation steps. If you forget anawait
, your script will keep running without waiting for the promise to resolve, often leading to errors. -
Use
try/catch
blocks for error handling. Wrapping your main automation logic in a try/catch makes it easier to debug issues. You can log errors, take screenshots on failure, or add other debugging logic. -
Modularize complex scripts. For longer automations, it‘s a good idea to break your code into smaller functions. You can define these in separate cells and then call them from your main automation function. This makes your code more readable and maintainable.
-
Take advantage of Jupyter‘s interactivity. Jupyter notebooks are great for iterating on your Puppeteer scripts. You can run cells individually, inspect variables, and make changes on the fly. It‘s a lot faster than the traditional edit-run-debug cycle.
Advanced Tricks
One of the coolest things about using Puppeteer in Jupyter is the ability to combine it with Python‘s rich data science ecosystem. For example, let‘s say you use Puppeteer to scrape some data from a website. You could then pass that data to a Python cell, where you could analyze it with Pandas, visualize it with Matplotlib, or even train a machine learning model on it with scikit-learn.
Here‘s a simple example of how that might look:
// Scrape some data with Puppeteer
let data = await page.evaluate(() => {
return [...document.querySelectorAll(‘.data-row‘)].map(row => {
return {
name: row.querySelector(‘.name‘).textContent,
value: parseInt(row.querySelector(‘.value‘).textContent)
};
});
});
// Pass the data to Python
pyodide.runPython(`
import pandas as pd
data = ${JSON.stringify(data)}
df = pd.DataFrame(data)
print(df.head())
print(f"Mean value: {df[‘value‘].mean()}")
`);
In this example, we scrape some tabular data from a web page using Puppeteer, then pass it to a Python cell using Pyodide. In the Python cell, we convert the data to a Pandas DataFrame, print the first few rows, and calculate the mean value.
This is just a taste of what‘s possible. With a bit of creativity, you can build powerful data pipelines that leverage the strengths of both JavaScript and Python.
Alternative Tools
While Puppeteer is a great tool for browser automation, it‘s not the only game in town. Another popular option is Playwright, which is similar to Puppeteer but supports multiple browsers (Chrome, Firefox, and Safari) out of the box.
Playwright doesn‘t have a dedicated Jupyter kernel, but you can still use it in notebooks with the async iJavaScript kernel we set up for Puppeteer. The API is very similar, so switching between the two is fairly straightforward.
There are also Python-native solutions for browser automation, like Selenium and Pyppeteer (a port of Puppeteer to Python). These can be used in standard Python-based Jupyter notebooks without the need for a special kernel.
Conclusion
Using Puppeteer in Jupyter notebooks might not be the conventional approach, but it opens up a world of possibilities for interactive browser automation. With the async-enabled JavaScript kernel, you can leverage the full power of Puppeteer in a notebook environment, and even combine it with Python for data analysis and visualization.
Whether you‘re a web developer looking to test your frontend code, a data scientist needing to scrape websites, or just someone who likes to automate tedious web tasks, give Puppeteer in Jupyter a try. It might just become your new favorite tool.
Happy automating!