Skip to content

Web Scraping with Elixir: An In-Depth Tutorial

Web scraping is the process of extracting data from websites. It allows you to collect information from across the web at scale that would be impractical to gather manually. Web scraping has a wide range of use cases, from monitoring prices and inventory to aggregating data for analysis to generating leads and contacts.

While you can scrape websites using any programming language, Elixir offers some unique advantages that make it well-suited for web scraping. Elixir runs on the Erlang virtual machine which provides high concurrency, fault tolerance, and distributed computing capabilities out of the box. This allows Elixir web scrapers to efficiently crawl and extract data from many web pages in parallel.

In this in-depth tutorial, we‘ll walk through how to build a highly scalable web scraper in Elixir from the ground up. We‘ll cover everything you need to know – from initial project setup to extracting and parsing the scraped data.

By the end, you‘ll have a fully working web scraper that can crawl product listings on Amazon to find the lowest prices on graphics cards. You‘ll also gain a solid foundation in web scraping concepts and techniques that you can apply to scrape data from any website for your own projects and applications.

Setting Up an Elixir Project for Web Scraping

First, make sure you have Elixir and Mix installed on your system. Then create a new Mix project for the web scraper:

mix new price_spider --sup

The –sup option generates an OTP application skeleton with a supervision tree, which we‘ll use to manage the crawler processes.

Add the Crawly web scraping framework and Floki HTML parser to your dependencies in mix.exs:

defp deps do  
  [
    {:crawly, "~> 0.13.0"},
    {:floki, "~> 0.26.0"}
  ]
end

Fetch the dependencies:

mix deps.get

Configure Crawly by creating a config/config.exs file:

config :crawly,
  middlewares: [
    {Crawly.Middlewares.UserAgent, user_agents: [
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/74.0.3729.169"  
    ]}
  ],
  pipelines: [
    {Crawly.Pipelines.WriteToFile, extension: "jl"}
  ]

This sets a browser-like user agent and configures Crawly to write output to a JSON lines file.

Building a Basic Crawler with Crawly

Crawly spiders define the crawler logic specific to a website. Basic spiders implement three callbacks:

  • init/0 – Initializes the spider with starting URLs
  • base_url/0 – Specifies the base URL of the website being crawled
  • parse_item/1 – Parses the response and extracts data to output

Create a new spider in lib/price_spider/spiders/basic_spider.ex:

defmodule PriceSpider.BasicSpider do
  use Crawly.Spider

  @impl Crawly.Spider
  def base_url(), do: "https://www.amazon.com"

  @impl Crawly.Spider  
  def init() do
    [
      start_urls: [
        "https://www.amazon.com/ZOTAC-Graphics-IceStorm-Advanced-ZT-A30810B-10P/dp/B09QPT6H1G",  
        "https://www.amazon.com/ASUS-Graphics-DisplayPort-Axial-tech-2-9-Slot/dp/B096L7M4XR",
        # ... more URLs
      ]
    ]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    %Crawly.ParsedItem{:items => [], :requests => []}  
  end
end

The init/0 callback specifies the initial URLs to crawl. We‘ve supplied a few Amazon URLs for RTX 3080 graphics cards.

The parse_item/1 callback currently just returns an empty ParsedItem struct. This is where we‘ll add code later to extract the scraped data.

Run the spider with:

iex> Crawly.Engine.start_spider(PriceSpider.BasicSpider)

Crawly will request each of the start URLs and pass the responses to parse_item/1. However, nothing useful is extracted yet.

Extracting Data from HTML with Floki

To extract the relevant data from each crawled Amazon page, we‘ll use Floki to parse the HTML and locate the data with CSS selectors.

Floki provides a convenient API for working with HTML similar to jQuery. We can load HTML and search it with:

iex> {:ok, document} = Floki.parse_document(html)

iex> Floki.find(document, "css-selector") 

This will return the HTML elements matching the given CSS selector.

Looking at the source of the Amazon pages, we can see the product name is in a span#productTitle tag and the price is in .a-price .a-offscreen.

Update the parse_item/1 callback to extract this data:

@impl Crawly.Spider
def parse_item(response) do
  {:ok, document} = 
    response.body
    |> Floki.parse_document()

  title = 
    document
    |> Floki.find("span#productTitle") 
    |> Floki.text()
    |> String.trim()

  price =
    document 
    |> Floki.find(".a-price .a-offscreen")
    |> Floki.text()
    |> String.trim()

  %Crawly.ParsedItem{
    :items => [
      %{title: title, price: price, url: response.request_url}    
    ],
    :requests => []
  }
end

This code parses the HTML body, extracts the product title and price, and returns them in the items field of a ParsedItem struct. The original URL crawled is included too.

When you run the spider again, it will now output the scraped data for each product to a file like tmp/price_spider_TIMESTAMP.jl:

{"title":"ZOTAC Gaming GeForce RTX 3080 Ti 12GB","price":"$1,199.88","url":"https://www.amazon.com/ZOTAC-Graphics-IceStorm-Advanced-ZT-A30810B-10P/dp/B09QPT6H1G"}

{"title":"ASUS ROG Strix NVIDIA GeForce RTX 3080 Ti OC","price":"$1,786.99","url":"https://www.amazon.com/ASUS-Graphics-DisplayPort-Axial-tech-2-9-Slot/dp/B096L7M4XR"}

Discovering and Crawling New Pages

So far the spider can only scrape the specific product pages we give it. To make it more useful, let‘s have it discover product links from an initial Amazon search results page.

Create a new spider in lib/price_spider/spiders/amazon_spider.ex:

defmodule PriceSpider.AmazonSpider do
  # ...

  @impl Crawly.Spider
  def init() do
    [
      start_urls: [  
        "https://www.amazon.com/s?k=rtx+3080"
      ]  
    ]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    {:ok, document} = 
      response.body
      |> Floki.parse_document()

    # Extract and follow pagination links  
    requests =
      document
      |> Floki.find("a.s-pagination-item")
      |> Floki.attribute("href")
      |> Enum.map(fn url -> 
        url |> build_absolute_url(response.request_url) 
      end)
      |> Enum.map(&Crawly.Utils.request_from_url/1)

    # Extract product URLs to crawl    
    product_requests = 
      document
      |> Floki.find("a.a-link-normal.s-no-outline")   
      |> Floki.attribute("href")
      |> Enum.map(fn url ->
        url |> build_absolute_url(response.request_url)
      end)
      |> Enum.map(&Crawly.Utils.request_from_url/1)

    # Parse product details if this is a product page
    {items, requests} =
      case Regex.match?(~r{/dp/}, response.request_url) do
        true -> 
          title = extract_title(document)
          price = extract_price(document)

          {[%{title: title, price: price, url: response.request_url}], []}
        false ->
          {[], product_requests ++ requests}
      end

    %Crawly.ParsedItem{:items => items, :requests => requests}
  end

  # Convert relative URLs to absolute
  defp build_absolute_url(url, request_url), do:
    URI.merge(request_url, url) |> to_string()

  defp extract_title(document) do
    document 
    |> Floki.find("span#productTitle") 
    |> Floki.text(deep: false)
    |> String.trim()
  end

  defp extract_price(document) do
    document
    |> Floki.find(".a-price .a-offscreen") 
    |> Floki.text(deep: false)
    |> String.trim()
  end  
end

This spider starts with an Amazon search URL for "rtx 3080". The parse_item/1 callback handles both the search results pages and individual product pages.

For search results pages, it extracts the "Next" pagination links and converts them to requests. It also extracts the links to each product and converts those to requests.

When a product page is parsed, it extracts the product title and price as before.

A ParsedItem is returned with the scraped product data in items and any discovered links to crawl in requests.

Now when you run PriceSpider.AmazonSpider, it will start with the search results, discover product and pagination links, crawl the products to scrape their details, and continue until it has crawled all the results pages. The scraped data will be output to a file.

Responsible Web Scraping

When scraping websites, it‘s important to do so responsibly and ethically. Some best practices include:

  • Respect robots.txt. Many sites specify crawling rules in a robots.txt file. Crawly can automatically obey these.

  • Limit concurrent requests and crawl rate. Scraping too aggressively can put excessive load on websites. Start slowly and throttle your crawl speed.

  • Set a descriptive user agent. This allows site owners to contact you if there are issues with your scraper.

  • Cache pages and data. Avoid repeatedly scraping the same content unnecessarily. Save responses locally.

  • Use API‘s when available. If a site provides a public API, that is usually a better option than scraping it.

Web scraping is a powerful tool, but comes with responsibility. Scrapers can inadvertently cause issues like increased server costs or even downtime for websites. Always consider the impact your web scraping may have.

Taking Your Elixir Web Scraper Further

We‘ve built a basic but functional Elixir web scraper for Amazon products. There are many ways you could expand on it:

  • Scrape additional data like ratings, reviews, specifications, etc
  • Support inputting arbitrary search terms
  • Crawl other major retailers like Newegg, BestBuy, etc to compare prices
  • Set up a recurring job to periodically check prices
  • Push notifications when prices drop below a certain threshold
  • Integrate with a frontend app to display the scraped data

Web scraping opens up all sorts of possibilities for building useful applications on top of data publicly available on the internet. With the foundation of web scraping in Elixir you‘ve learned in this tutorial, you‘re well equipped to tackle your own scraping projects, while enjoying the performance, reliability and scalability benefits of the Elixir ecosystem. Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *