Web Scraping with Client-Side Vanilla JavaScript

Web scraping is the process of extracting data from websites programmatically. While many popular web scraping tools utilize server-side technologies like Python and Node.js, it‘s also possible to scrape the web using only client-side JavaScript. In this post, we‘ll explore how to leverage your existing knowledge of Vanilla JS to start scraping without learning any new frameworks.

Why Scrape with Vanilla JavaScript?

Here are some of the key benefits of scraping with Vanilla JS:

Low barrier to entry – If you already know JavaScript, you can get started with web scraping quickly without learning a new language. Vanilla JS scraping has a gentle learning curve.
Front-end focus – For developers working mostly on front-end projects, Vanilla JS scraping allows you to reuse your existing skills.
Lightweight – Client-side scraping avoids the overhead of setting up and maintaining servers to run your scrapers.
Portability – Vanilla JS scrapers can run directly in the browser, making it easy to share and deploy your scrapers.
Stealth – Client-side scraping can be harder for websites to detect and block compared to server-side scraping.

So if you want a simple way to start extracting data from the web with JavaScript, Vanilla JS scraping is a great option! Next let‘s look at how it works under the hood.

How Client-Side Web Scraping Works

The basic steps for web scraping with Vanilla JS are:

Use fetch() to download the HTML of a page.
Parse the HTML with the DOM API to extract the desired data.
Transform and store the extracted data.
Repeat steps 1-3 for additional pages.

The key is that everything happens directly in the browser instead of on a server. The fetch() method allows us to make requests to download HTML, and the DOM API provides methods like document.querySelector() to analyze the HTML and pull out the data we want.

We can initiate the scraping process whenever we want by running our JavaScript scraper code. The scraping logic will run on the client-side and is sandboxed from the rest of the web page for security.

Now let‘s look at a simple example of a Vanilla JS scraper in action!

A Simple Example

Let‘s say we want to scrape some product data from an ecommerce website. Here‘s how we could do it with plain JavaScript:

// Fetch the HTML of the product page
fetch(‘https://example.com/products/1‘)
  .then(response => response.text())
  .then(html => {

    // Parse the HTML with the DOM 
    const doc = new DOMParser().parseFromString(html, ‘text/html‘);

    // Extract the product title
    const title = doc.querySelector(‘h1.product-title‘).textContent;

    // Extract the product description
    const description = doc.querySelector(‘div.product-description‘).textContent;

    // Extract the product price 
    const price = doc.querySelector(‘span.price‘).textContent;

    // Store the data somewhere, e.g. log to console
    const product = {
      title, 
      description,
      price
    };

    console.log(product);
  });

And that‘s really all there is to it! With just a few lines of Vanilla JS, we were able to scrape key data from a product page.

The great thing about this approach is that it directly leverages the standard web platform APIs that front-end developers are already familiar with. No special scraping libraries required!

Let‘s dig into the key steps a bit more.

Fetching Pages

The first step is downloading the HTML of the page we want to scrape. The modern way to make HTTP requests from JavaScript is with the Fetch API.

We can use fetch() to download the HTML of any public URL:

fetch(‘https://example.com‘)
  .then(response => response.text())
  .then(html => {
    // now we have the HTML of the page in the html variable
  });

The fetch() method returns a promise which resolves to a Response object containing the response data. Calling .text() on the response returns a promise that resolves with the content as text.

We provide a callback to .then() to run our scraping logic whenever the HTML is ready.

Parsing HTML

Once we have the HTML, the next step is parsing it to extract the data we want. The best API for programmatically analyzing HTML documents in the browser is the DOM API.

We can parse an HTML string into a document using the DOMParser class:

const parser = new DOMParser();
const doc = parser.parseFromString(html, ‘text/html‘);

This doc variable now contains a document object representing the parsed HTML.

We can use DOM methods like querySelector() to analyze and extract data from the document:

// Select elements
const headers = doc.querySelectorAll(‘h2‘);

// Get text content
const headerText = headers[0].textContent; 

// Get attribute values
const linkUrl = doc.querySelector(‘a.link‘).getAttribute(‘href‘);

The DOM API is quite extensive and allows you to programmatically simulate how a human analyzes a web page in the browser.

See this guide for more on using the DOM API for parsing and traversing HTML documents.

Storing Scraped Data

Once we‘ve extracted the data we want from the page, the next step is storing it somewhere. Simple options include:

Logging to the console – good for debugging
Saving to a JavaScript variable or data structure
Storing in localStorage – persists across sessions
Sending to a server via AJAX – e.g. to save scraped data in a database

For example:

// Log to console
console.log(extractedData); 

// Store in memory
let scrapedData = [];
scrapedData.push(extractedData);

// Save to localStorage 
localStorage.setItem(‘data‘, JSON.stringify(extractedData));

// Send to server
fetch(‘/api/data‘, {
  method: ‘POST‘,
  body: JSON.stringify(extractedData)
});

So those are some common patterns for persisting the scraped data client-side.

Scraping Multiple Pages

To scrape multiple pages, we wrap our scraping logic in a function that we can call iteratively:

async function scrapePage(url) {
  // Fetch HTML
  // Parse HTML
  // Extract data
  // Store data
}

const urls = [
  ‘https://example.com/page1‘,
  ‘https://example.com/page2‘,
  // ...
];

// Sequentially scrape each page
for (const url of urls) {
  await scrapePage(url); 
}

// Or scrape multiple pages concurrently 
await Promise.all(urls.map(scrapePage));

We can sequentially loop through and scrape each page, or use Promise.all() to scrape multiple pages concurrently.

This allows us to scrape entire websites programmatically!

Going Headless for More Scale

The examples so far run the scraping logic right in the browser. For more scale and runtime control, we can run our JavaScript scraper in a headless browser environment using a tool like Puppeteer.

Puppeteer provides a Node.js API for controlling Chrome (or Chromium) programmatically via DevTools protocol. This allows us to execute scraping scripts on a server while leveraging the latest browser rendering engine.

Here is an example Puppeteer script:

const puppeteer = require(‘puppeteer‘);

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();  

  await page.goto(‘https://example.com‘);

  // Extract data from page with page.$eval()

  await browser.close();
})();

So with Puppeteer managing the browser environment, we can scale up our client-side scrapers and run them on servers in the cloud.

There are also services like Apify and Playwright Cloud that provide hosted environments optimized for running large-scale scraping jobs.

Common Gotchas

Here are some common challenges to look out for when scraping pages with Vanilla JS:

Same-origin policy – Cannot access response data for pages on different domains. Proxies or services like Apify can help.
Async execution – JavaScript execution is asynchronous, so you need to sequence scraping steps properly.
Dynamic page content – Content loaded via JavaScript may not be present on initial page load. May need to wait for events like DOMContentLoaded.
Browser differences – Scripts may execute differently across browsers. Testing multiple browsers is advised.
Detecting scrapers – Websites may try to detect and block scrapers with methods like Browser Fingerprinting. Rotating proxies/user-agents can help.
Robot exclusion standards – Scrapers should respect standards like robots.txt. Browser extensions like RobotJS can help.

So those are some things to be aware of. Overall though, Vanilla JavaScript provides a very useful and accessible way to start scraping the web!

Scraping Ethics

It‘s important to note that while web scraping itself is generally legal, what you do with the scraped data may not be.

Be sure to scrape ethically and responsibly. Avoid causing excessive load on websites, respect robots.txt and any UI restrictions, and don‘t violate websites‘ terms of service.

Only collect data that is publicly accessible, and never share private data from scraped websites. Use scraped data only for personal or research purposes, not commercial benefit.

Adhering to these ethical principles helps ensure the longevity of web scraping as a useful technique.

Conclusion

Here are some key points we‘ve covered about web scraping with client-side JavaScript:

Web scraping involves programmatically extracting data from websites.
Vanilla JavaScript provides an accessible way to start scraping using standard browser APIs.
The Fetch API can retrieve page HTML, and the DOM API parses and extracts data.
Storing, transforming, and iterating allows scraping websites at scale.
Headless browsers like Puppeteer provide more power and control.
Following ethical principles is important when web scraping.

So leveraging your existing Vanilla JS skills is a quick way to start extracting useful data from web pages. The sky is the limit once you understand the fundamental techniques!

Happy (ethical) scraping!

Why Scrape with Vanilla JavaScript?

How Client-Side Web Scraping Works

A Simple Example

Fetching Pages

Parsing HTML

Storing Scraped Data

Scraping Multiple Pages

Going Headless for More Scale

Common Gotchas

Scraping Ethics

Conclusion

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python