Skip to content

Building Serverless Web Scrapers with AWS Lambda and TypeScript

Web scraping can be a useful technique for gathering data from websites. However, running scalable and reliable scrapers can require a lot of infrastructure. In this comprehensive guide, we‘ll explore how a serverless architecture using AWS Lambda and TypeScript can simplify the process.

What is Serverless Computing?

Before diving into the details, let‘s quickly overview the serverless concept.

Serverless computing allows you to build and run applications without having to manage infrastructure. Instead of provisioning virtual servers, you deploy application code that runs in stateless containers managed by the cloud provider.

Your code is broken into individual functions that can be triggered by various events like HTTP requests, database changes, queues, schedules, etc. The cloud provider automatically handles provisioning containers to run each function invocation.

Some key benefits of serverless:

  • No server management – The cloud provider handles capacity, scaling, patching, etc
  • Event-driven – Functions invoked on-demand in response to triggers
  • Subsecond scaling – Containers spin up in milliseconds to handle each function call
  • Pay per use – Pay only for the compute time used when code is running
  • Automated high availability – No worries about redundancy or fault tolerance

According to Cloudflare, adoption of serverless has been rapidly growing with a 200% increase in usage among surveyed organizations over the past year.

AWS Lambda is one of the most widely used serverless platforms. It allows you to easily publish code that automatically scales with demand.

Next, let‘s see how Lambda can be useful for web scraping projects.

Web Scraping Architecture Overview

Before jumping into the code, let‘s briefly discuss common web scraping architectures.

A typical scraper involves the following components:

  • Request Generator – Generates URLs or search parameters to scrape
  • RequestDispatcher – Sends HTTP requests to the target URLs
  • Downloader – Downloads the HTML content from URLs
  • Parser – Extracts data from the HTML using something like XPath or regex
  • Data Store – Saves extracted data, often to a database or file storage

Web scraper architecture

Running this at scale often requires provisioning and maintaining servers to handle the load. Complex scrapes may also get your IP address blocked.

Serverless architectures help by allowing you to:

  • Scale automatically – No capacity planning required
  • Distribute workload – Spread across 1000s of containers
  • Reduce costs – No charges when code isn‘t running
  • Improve resiliency – Built-in availability and fault tolerance

Next let‘s look at how AWS Lambda can be used in a serverless scraping architecture.

Scraping with AWS Lambda

AWS Lambda allows you to deploy "functions" – self-contained blocks of code invoked in response to events.

For web scraping, Lambda functions work well for:

  • Request Generation – Schedule or invoke Lambda on demand
  • Request Dispatching – Trigger Lambdas in parallel to scrape pages
  • Data Parsing – Process and extract data from HTML
  • Saving Data – Insert into DynamoDB, S3, etc

You only pay for the compute time used when the code executes. And Lambda can scale to thousands of concurrent executions as needed.

Let‘s walk through a hands-on example to see Lambda web scraping in action.

Project Overview

We‘ll build a simple serverless web scraper that extracts the title and meta description from a page. For this example, we‘ll use:

  • Lambda – For the scraping function
  • TypeScript – For cleaner code over vanilla JavaScript
  • Cheerio – Fast HTML parsing library
  • Axios – Promised-based HTTP client
  • AWS SAM – For packaging and deployment

Here is an overview of the steps:

  1. Set up the Lambda scraping function locally
  2. Test the function and validate the output
  3. Package the function with SAM
  4. Deploy the packaged template to AWS

Let‘s get started!

Prerequisites

Before we begin, you‘ll need:

  • Node.js 12.x or higher
  • AWS CLI – configured with your access keys
  • AWS SAM CLI – npm install -g aws-sam-cli

Project Setup

First, create a new directory:

$ mkdir lambda-webscraper
$ cd lambda-webscraper

Initialize npm:

$ npm init -y

Install the NPM packages:

$ npm install typescript cheerio axios @types/node @types/cheerio

Implementing the Scraper

Under the lambda-webscraper directory, create an index.ts file with the following:

import axios from ‘axios‘;
import * as cheerio from ‘cheerio‘;

interface ScrapeResult {
  title: string;
  description: string;
}

const scrapeSite = async (): Promise<ScrapeResult> => {

  const response = await axios.get(‘https://apify.com‘);

  const $ = cheerio.load(response.data);

  const title = $(‘title‘).text();

  const description = $(‘meta[name="description"]‘).attr(‘content‘);

  return { title, description };

};

export const handler = async () => {

  const result = await scrapeSite();

  const response = {
    statusCode: 200,    
    body: JSON.stringify(result),
  };

  return response;

};

This implements our cheerio-based scraper and a handler for the Lambda function.

Local Development

To test locally, compile and run:

npx tsc
node index.js 

This will invoke the function and output the result:

{
  "title": "Apify - Automate web scraping and data extraction",
  "description": "Apify scrapers extract data from websites. Build a web scraper in 5 mins. Get high quality datasets. Scrape the web using proxies & headless browsers"
}

Packaging and Deployment

Now that our function works, we can package it for deployment using SAM.

Create a template.yaml:

AWSTemplateFormatVersion: ‘2010-09-09‘
Transform: AWS::Serverless-2016-10-31
Description: Simple web scraper example

Resources:
  ScraperFunction: 
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: .
      Handler: index.handler
      Runtime: nodejs12.x
      Architectures:
        - x86_64

This defines the Lambda resource and points it to our index.js handler.

We can now deploy the stack:

sam deploy --guided

Follow the prompts to deploy the function. Once deployed, our scraper will be live on AWS Lambda!

Running and Monitoring

To execute the scraper, we can configure triggers like scheduled CloudWatch Events. Or invoke it on-demand:

aws lambda invoke --function-name ScraperFunction out.json

This will run the scraper and save the output to out.json.

We can view metrics and logs for the function in the Lambda console or CloudWatch. This helps monitor performance and debug issues.

Additional Patterns and Services

Let‘s discuss some additional ways to build out a production-grade solution:

Scheduled Scrapes

Use CloudWatch Events to trigger the Lambda scraper on a schedule. This allows regularly scraping websites without managing Cron servers.

Queue-Based Workflow

Have one Lambda function enqueue scrape URLs onto SQS. Separate lambdas can pull jobs off the queue and handle the scraping. This helps coordinate and scale complex workflows.

Distributed Scraping

Launch scraper Lambdas across multiple regions and availability zones to increase throughput and redundancy.

Error Handling

Make use of DLQs (dead letter queues) to catch and retry failed events. Alerting can also be configured in services like CloudWatch.

Data Storage

Save scrape results to S3, DynamoDB or tools like AWS Glue. Integrate services like Lambda, SQS, and Kinesis for ETL pipelines.

Recap

Let‘s review what we covered:

  • Serverless concepts – overview of Lambda and benefits for web scraping
  • Project walkthrough – creating a scraper handler with TypeScript
  • Local testing – validating the function works before deployment
  • Deployment – using SAM to package and deploy the Lambda function
  • Execution – running the scraper on AWS
  • Monitoring – tracking metrics logs to observe behavior
  • Additional services – other relevant AWS tools to build robust solutions

Adopting a serverless architecture using Lambda and TypeScript can simplify creating and operating web scrapers at scale. The core benefits are automatic scaling, reduced cost and maintenance overhead compared to traditional servers.

Further Reading

To dive deeper, check out these additional resources:

Let me know if you have any other questions!

Join the conversation

Your email address will not be published. Required fields are marked *