Web scraping and browser automation have become essential techniques for businesses to gather and analyze data from the web. In this comprehensive tutorial, we‘ll learn how to use Crawlee – a powerful toolkit for scraping websites and automating browsers.
Overview of Crawlee
Crawlee is an open-source web scraping and automation framework built on Node.js. It provides a flexible API and unified interface to scrape data using simple HTTP requests or fully automated headless browsers like Puppeteer and Playwright.
Some key features of Crawlee:
- Supports both simple DOM scraping as well as complex scraping scenarios involving JavaScript rendering, cookies, proxies etc.
- Handles all the browser automation so you can focus on writing the scraping logic
- Integrated queue to crawl multiple URLs sequentially or in parallel
- Customizable middleware architecture to inject logic before/after requests
- Pluggable storage options, allowing easy export of structured data
- Works with popular headless browsers like Puppeteer, Playwright and Selenium
- Can be deployed and scaled using Docker containers
- Built-in integrations with proxy services to prevent IP blocking
In short, Crawlee takes care of the tedious automation and orchestration work, allowing you to focus on writing scrapers quickly. The simple API abstracts away the complexities of managing browsers, proxies, and concurrency.
Installing Crawlee
Crawlee can be installed via npm:
npm install crawlee
However, the easiest way to get started is using the Crawlee CLI tool to bootstrap a new project:
npx crawlee create my-scraper
This will create a new folder my-scraper
with a basic scraper template to start customizing!
The template includes a package.json
with Crawlee installed, as well as example code in src/
to showcase common scraping tasks.
Basic Usage
The fundamental building blocks of any Crawlee scraper are:
- Crawler: Responsible for coordinating and orchestrating the scraping job
- Request Handler: Contains the scraping logic executed on each page
Let‘s see a simple crawler in action:
// Import Crawlee classes
const { CheerioCrawler } = require(‘crawlee‘);
// Instantiate crawler
const crawler = new CheerioCrawler({
// Configure request handler
requestHandler: ($) => {
// Use Cheerio to extract data
const titles = $(‘h2.post-title‘)
.map((i, el) => $(el).text())
.get();
console.log(titles);
}
})
// Run crawler on list of URLs
crawler.run([‘https://blog.scrapinghub.com‘]);
This implements a basic scraper using CheerioCrawler to extract all post titles from a blog homepage.
The requestHandler
callback is invoked on each page, passed the $
Cheerio instance to query the DOM. We grab all <h2>
elements with the .post-title
class, extract the text, and log the results.
Finally, we initiate the crawl on the target URLs. Crawlee will automatically handle requesting the pages, executing the handlers, and following links.
There are also PuppeteerCrawler
and PlaywrightCrawler
available to leverage real headless browsers for dynamic scraping.
Scraping Dynamic Content
For modern sites relying heavily on JavaScript, a simple HTTP scraper isn‘t enough. Crawlee supports using headless browsers like Puppeteer and Playwright to render full pages with dynamic content.
For example:
// Import BrowserCrawler and launch options
const { BrowserCrawler, LaunchOptions } = require(‘crawlee‘);
const crawler = new BrowserCrawler({
// Configure browser
launchContext: {
launchOptions: {
headless: false,
},
},
// Browser request handler
requestHandler: async ({ page, enqueueLinks }) => {
// Wait for content to load
await page.waitForSelector(‘.dynamic-element‘);
// Extract data from page
const titles = await page.$$eval(‘.post-title‘, els => els.map(e => e.textContent));
console.log(titles);
// Find links on page
const anchors = await page.$$(‘a‘);
// Enqueue new links
anchors.forEach(a => {
enqueueLinks(await a.getProperty(‘href‘));
});
}
});
// Run crawler
crawler.run([‘https://example.com‘]);
The above demo showcases:
- Launching a real Chrome browser in headful mode using
launchContext
- Waiting for elements to appear using
page.waitForSelector()
- Extracting dynamic content with
page.$$eval()
- Finding links and enqueueing them for crawling using
enqueueLinks()
Using headless browsers unlocks the ability to scrape complex SPAs, interact with elements, fill forms, execute APIs calls, and more!
Managing Sessions and State
For some sites with advanced bot protection, we need to mimic real user behavior like preserving cookies and session state across page navigation.
The SessionPool
class handles creating and managing browser sessions:
// Import session pool
const { SessionPool } = require(‘crawlee‘);
const pool = new SessionPool({
// Configure sessions
launchContext: {
// ...
},
sessionOptions: {
maxPoolSize: 10
}
});
const crawler = new BrowserCrawler({
// Inject session pool into crawler
sessionPool: pool,
// Reuse sessions for same origin
shouldNewSession: ({ session, request }) => {
return session.origin !== new URL(request.url).origin;
},
// ...
})
The key points:
SessionPool
manages a pool of browser sessions- New sessions are created up to
maxPoolSize
limit shouldNewSession()
determines if the request should reuse existing session or spawn a new one- Sessions help mimic user actions like maintaining cookies and logins
This makes it easy to handle complex sites needing stateful sessions and logins maintained across page navigations!
Handling Blocking and CAPTCHAs
Websites can detect and block scrapers through IP analysis, browser fingerprints, etc. To avoid getting blocked, it‘s useful to route traffic through proxies and rotate them automatically.
Crawlee has native support for proxy services like Oxylabs to provide thousands of fresh IPs:
// Import proxy handler
const { ProxyConfiguration } = require(‘crawlee‘);
// Create proxy config
const proxies = new ProxyConfiguration({
proxyUrls: [
‘http://user:[email protected]:10000‘,
‘http://user:[email protected]:10000‘,
// ...
]
})
// Inject proxies into crawler
const crawler = new BrowserCrawler({
// ...
proxyConfiguration: proxies
})
By routing requests through different proxies, websites have no single IP to associate and block.
Proxies also help bypass geographic restrictions and access content from different locations. Crawlee handles proxy rotation, configuration, and management automatically!
Storing Scraped Data
Crawlee makes it simple to store scraped data using pluggable storage adapters. It comes built-in with three options:
In-Memory
Data is stored in a JSON object in crawler memory. Great for debugging and testing.
File System
Data can be exported to the disk in JSON, CSV or other formats.
MySQL
Results can be inserted into a MySQL database for persistent storage and analysis.
Usage is straightforward – instantiate a storage plugin, and pass it to the crawler on creation:
// Import storage adapters
const { FSAdapter, MySQLAdapter } = require(‘crawlee‘);
const fsStorage = new FSAdapter({
// Configure output location
output: ‘./output‘
});
const sqlStorage = new MySQLAdapter({
// MySQL connection config
});
const crawler = new CheerioCrawler({
// Inject storage
storage: fsStorage
// or
storage: sqlStorage
})
Crawlee handles exporting the scraped data, freeing you to focus on writing the scraping logic!
Configuring and Customizing Crawlers
Crawlee provides many options to customize scraping behavior for different sites and use cases:
Tuning crawling speed
Control request concurrency, delays, retry limits, timeouts etc.
Setting priority
Configure priority queues for high vs low priority requests.
Customizing crawling logic
Override request lifecycles with custom logic using hooks.
Middleware
Inject middleware to modify requests and responses.
Pluggable storage
Choose where and how to export data.
Exporting events
Subscribe to Crawlee events for analytics.
Browser management
Customize browsers, devices, locations and other options.
Integrations
Use Crawlee as a part of your larger scraping infrastructure.
Between the flexibility of the API and built-in tools like proxy management, Crawlee provides everything you need to build robust and production-ready scrapers.
Putting It All Together
Let‘s walk through a more complete web scraping script using some of Crawlee‘s advanced features:
// Import Crawlee classes
const { PlaywrightCrawler, SessionPool, FSAdapter } = require(‘crawlee‘);
// Create session pool
const pool = new SessionPool({ /* ... */ });
// Instantiate crawler
const crawler = new PlaywrightCrawler({
// Use session pool
sessionPool: pool,
// Export data to JSON files
storage: new FSAdapter(),
// Add random delays between requests
requestHandler: {
randomDelaySecs: 5,
},
// Limit to 50 requests per minute
requestHandler: {
maxRequestsPerMinute: 50,
},
// Custom middleware to inject headers
middlewares: [
(req) => {
req.headers[‘User-Agent‘] = ‘Custom User Agent‘;
}
],
// Configure Playwright browser
launchContext: {
headless: true
},
// Request handler logic
requestHandler: async ({ page, session }) => {
// Check if logged in
if (session.loggedIn) {
// Go to account page
} else {
// Login first
await page.type(‘#username‘, ‘user‘);
await page.type(‘#password‘, ‘pass‘);
await page.click(‘#login‘);
// Set session flag
session.loggedIn = true;
}
// Rest of scraping logic
// ...
},
});
// Start crawling
crawler.run([‘https://targetwebsite.com‘]);
This showcases:
- Launching browsers using
SessionPool
- Adding middleware for custom logic
- Exporting data to the filesystem
- Limiting request rate
- Applying random delays
- Persisting state on sessions like login status
- Normal scraping logic with Playwright browser
The crawler handles all the orchestration, while we can focus on writing the important scraping code!
Conclusion
In this tutorial, we covered the fundamentals of using Crawlee for web scraping and browser automation tasks:
- Crawlee provides a flexible framework for building robust web scrapers.
- It handles browser automation and orchestration so you can focus on writing scrapers.
- Supports scraping dynamic content, managing stateful sessions, rotating proxies, storing data and more.
- Can be fully customized and configured based on your specific scraping needs.
Crawlee lets you get up and running with scrapers quickly, while also providing advanced tools to scale and run them reliably in production.
Some next steps to continue learning:
- Read the in-depth Crawlee docs
- Look at the example scripts
- Join the Crawlee Discord to ask questions
- Check my Github for more web scraping content!
Happy scraping!