Skip to content

Crawlee Tutorial: Easy Web Scraping and Browser Automation

Web scraping and browser automation have become essential techniques for businesses to gather and analyze data from the web. In this comprehensive tutorial, we‘ll learn how to use Crawlee – a powerful toolkit for scraping websites and automating browsers.

Overview of Crawlee

Crawlee is an open-source web scraping and automation framework built on Node.js. It provides a flexible API and unified interface to scrape data using simple HTTP requests or fully automated headless browsers like Puppeteer and Playwright.

Some key features of Crawlee:

  • Supports both simple DOM scraping as well as complex scraping scenarios involving JavaScript rendering, cookies, proxies etc.
  • Handles all the browser automation so you can focus on writing the scraping logic
  • Integrated queue to crawl multiple URLs sequentially or in parallel
  • Customizable middleware architecture to inject logic before/after requests
  • Pluggable storage options, allowing easy export of structured data
  • Works with popular headless browsers like Puppeteer, Playwright and Selenium
  • Can be deployed and scaled using Docker containers
  • Built-in integrations with proxy services to prevent IP blocking

In short, Crawlee takes care of the tedious automation and orchestration work, allowing you to focus on writing scrapers quickly. The simple API abstracts away the complexities of managing browsers, proxies, and concurrency.

Installing Crawlee

Crawlee can be installed via npm:

npm install crawlee

However, the easiest way to get started is using the Crawlee CLI tool to bootstrap a new project:

npx crawlee create my-scraper

This will create a new folder my-scraper with a basic scraper template to start customizing!

The template includes a package.json with Crawlee installed, as well as example code in src/ to showcase common scraping tasks.

Basic Usage

The fundamental building blocks of any Crawlee scraper are:

  • Crawler: Responsible for coordinating and orchestrating the scraping job
  • Request Handler: Contains the scraping logic executed on each page

Let‘s see a simple crawler in action:

// Import Crawlee classes
const { CheerioCrawler } = require(‘crawlee‘); 

// Instantiate crawler
const crawler = new CheerioCrawler({
  // Configure request handler  
  requestHandler: ($) => {
    // Use Cheerio to extract data
    const titles = $(‘h2.post-title‘)
      .map((i, el) => $(el).text())
      .get();

    console.log(titles);
  } 
})

// Run crawler on list of URLs 
crawler.run([‘https://blog.scrapinghub.com‘]);

This implements a basic scraper using CheerioCrawler to extract all post titles from a blog homepage.

The requestHandler callback is invoked on each page, passed the $ Cheerio instance to query the DOM. We grab all <h2> elements with the .post-title class, extract the text, and log the results.

Finally, we initiate the crawl on the target URLs. Crawlee will automatically handle requesting the pages, executing the handlers, and following links.

There are also PuppeteerCrawler and PlaywrightCrawler available to leverage real headless browsers for dynamic scraping.

Scraping Dynamic Content

For modern sites relying heavily on JavaScript, a simple HTTP scraper isn‘t enough. Crawlee supports using headless browsers like Puppeteer and Playwright to render full pages with dynamic content.

For example:

// Import BrowserCrawler and launch options 
const { BrowserCrawler, LaunchOptions } = require(‘crawlee‘);

const crawler = new BrowserCrawler({

  // Configure browser
  launchContext: {
    launchOptions: {
      headless: false,
    },    
  },

  // Browser request handler
  requestHandler: async ({ page, enqueueLinks }) => {

    // Wait for content to load    
    await page.waitForSelector(‘.dynamic-element‘);

    // Extract data from page
    const titles = await page.$$eval(‘.post-title‘, els => els.map(e => e.textContent));

    console.log(titles);

    // Find links on page
    const anchors = await page.$$(‘a‘);

    // Enqueue new links    
    anchors.forEach(a => {
      enqueueLinks(await a.getProperty(‘href‘)); 
    });

  }

});

// Run crawler  
crawler.run([‘https://example.com‘]); 

The above demo showcases:

  • Launching a real Chrome browser in headful mode using launchContext
  • Waiting for elements to appear using page.waitForSelector()
  • Extracting dynamic content with page.$$eval()
  • Finding links and enqueueing them for crawling using enqueueLinks()

Using headless browsers unlocks the ability to scrape complex SPAs, interact with elements, fill forms, execute APIs calls, and more!

Managing Sessions and State

For some sites with advanced bot protection, we need to mimic real user behavior like preserving cookies and session state across page navigation.

The SessionPool class handles creating and managing browser sessions:

// Import session pool
const { SessionPool } = require(‘crawlee‘);

const pool = new SessionPool({
  // Configure sessions
  launchContext: {
    // ... 
  },
  sessionOptions: {
    maxPoolSize: 10    
  }  
});

const crawler = new BrowserCrawler({

  // Inject session pool into crawler
  sessionPool: pool,

  // Reuse sessions for same origin  
  shouldNewSession: ({ session, request }) => {
    return session.origin !== new URL(request.url).origin;
  },

  // ...

})

The key points:

  • SessionPool manages a pool of browser sessions
  • New sessions are created up to maxPoolSize limit
  • shouldNewSession() determines if the request should reuse existing session or spawn a new one
  • Sessions help mimic user actions like maintaining cookies and logins

This makes it easy to handle complex sites needing stateful sessions and logins maintained across page navigations!

Handling Blocking and CAPTCHAs

Websites can detect and block scrapers through IP analysis, browser fingerprints, etc. To avoid getting blocked, it‘s useful to route traffic through proxies and rotate them automatically.

Crawlee has native support for proxy services like Oxylabs to provide thousands of fresh IPs:

// Import proxy handler
const { ProxyConfiguration } = require(‘crawlee‘);

// Create proxy config
const proxies = new ProxyConfiguration({
  proxyUrls: [
    ‘http://user:[email protected]:10000‘,
    ‘http://user:[email protected]:10000‘,
   // ...
  ] 
})

// Inject proxies into crawler
const crawler = new BrowserCrawler({
  // ...
  proxyConfiguration: proxies 
})

By routing requests through different proxies, websites have no single IP to associate and block.

Proxies also help bypass geographic restrictions and access content from different locations. Crawlee handles proxy rotation, configuration, and management automatically!

Storing Scraped Data

Crawlee makes it simple to store scraped data using pluggable storage adapters. It comes built-in with three options:

In-Memory

Data is stored in a JSON object in crawler memory. Great for debugging and testing.

File System

Data can be exported to the disk in JSON, CSV or other formats.

MySQL

Results can be inserted into a MySQL database for persistent storage and analysis.

Usage is straightforward – instantiate a storage plugin, and pass it to the crawler on creation:

// Import storage adapters
const { FSAdapter, MySQLAdapter } = require(‘crawlee‘);

const fsStorage = new FSAdapter({
  // Configure output location
  output: ‘./output‘ 
});

const sqlStorage = new MySQLAdapter({
  // MySQL connection config  
});

const crawler = new CheerioCrawler({
  // Inject storage  
  storage: fsStorage

  // or

  storage: sqlStorage
}) 

Crawlee handles exporting the scraped data, freeing you to focus on writing the scraping logic!

Configuring and Customizing Crawlers

Crawlee provides many options to customize scraping behavior for different sites and use cases:

Tuning crawling speed

Control request concurrency, delays, retry limits, timeouts etc.

Setting priority

Configure priority queues for high vs low priority requests.

Customizing crawling logic

Override request lifecycles with custom logic using hooks.

Middleware

Inject middleware to modify requests and responses.

Pluggable storage

Choose where and how to export data.

Exporting events

Subscribe to Crawlee events for analytics.

Browser management

Customize browsers, devices, locations and other options.

Integrations

Use Crawlee as a part of your larger scraping infrastructure.

Between the flexibility of the API and built-in tools like proxy management, Crawlee provides everything you need to build robust and production-ready scrapers.

Putting It All Together

Let‘s walk through a more complete web scraping script using some of Crawlee‘s advanced features:

// Import Crawlee classes
const { PlaywrightCrawler, SessionPool, FSAdapter } = require(‘crawlee‘);

// Create session pool 
const pool = new SessionPool({ /* ... */ });

// Instantiate crawler
const crawler = new PlaywrightCrawler({

  // Use session pool 
  sessionPool: pool,

  // Export data to JSON files
  storage: new FSAdapter(), 

  // Add random delays between requests
  requestHandler: {
    randomDelaySecs: 5,
  },

  // Limit to 50 requests per minute
  requestHandler: {
    maxRequestsPerMinute: 50, 
  },

  // Custom middleware to inject headers  
  middlewares: [
    (req) => {
      req.headers[‘User-Agent‘] = ‘Custom User Agent‘; 
    }
  ],

  // Configure Playwright browser
  launchContext: {
    headless: true
  },

  // Request handler logic  
  requestHandler: async ({ page, session }) => {

    // Check if logged in
    if (session.loggedIn) {
      // Go to account page
    } else {
      // Login first
      await page.type(‘#username‘, ‘user‘);
      await page.type(‘#password‘, ‘pass‘); 
      await page.click(‘#login‘);

      // Set session flag  
      session.loggedIn = true; 
    }

    // Rest of scraping logic  
    // ...

  },

});

// Start crawling  
crawler.run([‘https://targetwebsite.com‘]);

This showcases:

  • Launching browsers using SessionPool
  • Adding middleware for custom logic
  • Exporting data to the filesystem
  • Limiting request rate
  • Applying random delays
  • Persisting state on sessions like login status
  • Normal scraping logic with Playwright browser

The crawler handles all the orchestration, while we can focus on writing the important scraping code!

Conclusion

In this tutorial, we covered the fundamentals of using Crawlee for web scraping and browser automation tasks:

  • Crawlee provides a flexible framework for building robust web scrapers.
  • It handles browser automation and orchestration so you can focus on writing scrapers.
  • Supports scraping dynamic content, managing stateful sessions, rotating proxies, storing data and more.
  • Can be fully customized and configured based on your specific scraping needs.

Crawlee lets you get up and running with scrapers quickly, while also providing advanced tools to scale and run them reliably in production.

Some next steps to continue learning:

Happy scraping!

Tags:

Join the conversation

Your email address will not be published. Required fields are marked *