How to Build an E-Commerce Scraper with Cheerio + JavaScript

E-commerce websites are a treasure trove of valuable data. Product information, prices, inventory levels, and more can be extracted and utilized for competitive intelligence, market research, pricing optimization, dropshipping automation, and many other use cases.

In this comprehensive tutorial, you‘ll learn how to leverage the power of Cheerio and JavaScript to build a fully automated scraper for any e-commerce website. We‘ll walk through analyzing the target site, configuring Cheerio Scraper, writing custom page handling logic, integrating with Gmail, and scheduling recurring runs.

By the end, you‘ll have a scalable scraper that extracts key e-commerce data and delivers it to your inbox on autopilot. Let‘s dive in!

Analyzing the Target E-Commerce Website

Before writing a single line of code, we need to understand the structure and key elements of our target e-commerce site. The goal is to identify:

The base URL and parameters for search queries
URL patterns for browsing product categories
URL patterns for individual product pages
Locations of critical data we want to extract (title, description, price, etc)

This analysis will inform how we configure our Cheerio scraper.

The easiest way is to manually explore the site – search for products, visit category and item pages, and use browser developer tools to inspect elements and URLs.

For example, on an Amazon product page, we can see:

The product title is in an H1 tag
The ASIN sku is in a span with id="productDetails_detailBullets_sections1"
The price is in span id="priceblock_ourprice"

Do this reconnaissance on your target site to find CSS selectors for the data you need.

Also check if the site relies heavily on JavaScript. One trick is to disable JS in your browser and see if content still loads. If not, we may need a browser automation tool like Puppeteer instead of Cheerio.

Okay, let‘s pretend we completed analysis of ExampleStore.com and gathered the following:

Base URL: https://www.examplestore.com
Search URL: https://www.examplestore.com/search?q={searchTerm}&page={pageNumber}
Product URL: https://www.examplestore.com/product/{productId}/
Title Selector: h1.product-title
Description Selector: div.product-desc
Price Selector: span.price

Armed with this information, we‘re ready to configure our Cheerio scraper!

Configuring Cheerio Scraper

Head over to Apify and create a new Cheerio Scraper actor. We‘ll use the following key configuration fields:

Start URLs – We‘ll set this to our search URL that accepts searchTerm and pageNumber parameters.

Link Selector – We want to extract links from each page, so we‘ll use a simple "a" selector.

Pseudo URLs – No need for these since we‘re using Glob Patterns.

Glob Patterns – We‘ll create two patterns here:

To match pagination URLs like https://www.examplestore.com/search?q=shoes&page=2
To match product URLs like https://www.examplestore.com/product/1234/

User Data – Useful for labeling request types e.g. "SEARCH", "LISTING", "PRODUCT"

Page Function – This is where we‘ll write custom logic to handle each page type.

Let‘s save this configuration as an Apify task and we‘re ready to move on to writing the Page Function next!

Writing the Page Function

The Page Function runs on each page loaded by Cheerio Scraper. Here we can analyze the current request, perform actions like data extraction or enqueueing more URLs, and return scraped data.

First we‘ll check the request.userData.label to determine the page type:

if (request.userData.label === ‘SEARCH‘) {
  // Search page logic
} else if (request.userData.label === ‘LISTING‘) {
  // Category listing page logic 
} else if (request.userData.label === ‘PRODUCT‘) {
  // Product detail page logic
}

For the search page, we may want to scrape the total number of products found and log it.

On listing pages, we can enqueue more product detail URLs to scrape.

Finally, on product pages we can use our selectors to extract critical data fields, build a results object, and return it:

const results = {
  url: request.url,
  title: $(selectors.title).text(),
  description: $(selectors.description).text(),
  price: $(selectors.price).text(),
}

return results;

The objects returned from each product page will be stored in the resulting dataset.

This page function logic allows us to handle each page type differently and customize our scraping behavior.

Integrating with Gmail to Receive Scraped Data

Now that our scraper is extracting data, we likely want to receive the results without having to manually export them from Apify after each run.

Luckily, Apify provides 60+ integrations including email, cloud storage, databases, APIs, and more.

Let‘s set up a Gmail integration so our scraped dataset gets emailed to us automatically.

In your actor, go to the Integrations tab:

Click "Add integration" and select Gmail
Login with your Google account
Select the email address to send results to
Customize subject line and body as needed
Hit Save and enable the integration

Now each time our actor runs, we‘ll receive an email with the scraped data file attached!

Scheduling the Scraper to Run Automatically

The last step is to set up a schedule so our scraper runs automatically on a recurring basis.

In the actor page, click "Schedules" and then "Add schedule".

We can give the schedule a label, set a cron expression for recurring runs (e.g. daily), and select a timezone.

Now our Cheerio e-commerce scraper will run like clockwork according to the schedule, scrape updated data, and deliver the results right to our inbox!

Conclusion

In this tutorial we:

Analyzed an e-commerce site to identify key data fields
Configured a Cheerio Scraper with Start URLs, Glob Patterns, and Page Function
Wrote custom page logic to handle search, listing, and product pages
Integrated Gmail to automatically receive scraped datasets
Scheduled the actor to run on a recurring basis

You now have a complete blueprint for building scalable e-commerce web scrapers with Cheerio!

Some next steps to extend your scraper:

Scrape additional data fields like reviews, images, inventory etc
Save scraped data to a database or API instead of just email
Deploy the scraper on Apify Cloud for serverless scalability

We‘ve only scratched the surface of what‘s possible with web scraping and automation using Apify. To learn more, check out:

Documentation for all Apify SDKs and tools like Puppeteer Scraper and Crawlee
Web Scraping 101 resource including free ebooks and video courses
Apify‘s Community Forum to ask questions and meet fellow developers

Happy scraping!

Analyzing the Target E-Commerce Website

Configuring Cheerio Scraper

Writing the Page Function

Integrating with Gmail to Receive Scraped Data

Scheduling the Scraper to Run Automatically

Conclusion

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python