Skip to content

OCaml Web Scraping: From Beginner to Advanced

OCaml may not be the first language that comes to mind for web scraping, but this modern functional programming language is an excellent choice. Its static typing and expressive features like pattern matching can make scraping code more robust and maintainable. Let‘s dive into scraping websites with OCaml, from simple static sites to dynamic pages rich with JavaScript.

Setting Up an OCaml Environment for Web Scraping

Before we start coding, you‘ll need to install OCaml and some useful tools and libraries. The easiest way is via the opam package manager:

  1. Install opam by following the official instructions for your operating system.
  2. Initialize opam and install the latest OCaml compiler:
opam init 
opam switch create 4.13.1
eval $(opam env)
  1. Install the dune build system:
opam install dune
  1. Install the libraries we‘ll be using:
opam install cohttp-lwt-unix lambdasoup

Now create a new dune project for your scraper:

dune init proj my_scraper
cd my_scraper

Open the dune file and add the libraries to the executable stanza:

(executable
 (name main)
 (libraries cohttp-lwt-unix lambdasoup))

You‘re all set to start writing your OCaml scraper!

Scraping Static Sites with Cohttp and Lambdasoup

As an example, let‘s scrape quotes from the static site http://quotes.toscrape.com. Open my_scraper.ml and add this code:

open Lwt.Infix
open Cohttp_lwt_unix
open Soup

let main () =
  Client.get (Uri.of_string "http://quotes.toscrape.com") >>= fun (resp, body) ->  
    body |> Cohttp_lwt.Body.to_string >|= fun html ->
      let soup = Soup.parse html in
      soup $$ "div.quote" |> Soup.to_list |> List.iter (fun quote ->
        let text = quote $ "span.text" |> Soup.R.leaf_text in
        let author = quote $ "small.author" |> Soup.R.leaf_text in            
        Printf.printf "%s - %s\n" text author
      );
      Lwt.return_unit

let () = Lwt_main.run (main ())  

This code does the following:

  1. Sends a GET request to the quotes website using Cohttp
  2. Converts the response body to a string
  3. Parses the HTML using Lambdasoup
  4. Finds all quote div elements using the CSS selector div.quote
  5. For each quote, extracts the text and author using CSS selectors
  6. Prints out each quote‘s text and author

Build and run it:

dune exec ./my_scraper.exe

You should see output like:

"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking." - Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities." - J.K. Rowling
...

With just a few lines of OCaml, we fetched the quotes page, parsed its HTML, and extracted structured data using CSS selectors. Lambdasoup provides many more options for traversing and manipulating HTML elements – see its full documentation for details.

Some additional best practices to consider:

  • Set a custom User-Agent header on your requests to clearly identify your scraper
  • Limit your request rate to avoid overloading servers
  • Handle errors and corner cases, like missing elements or connection issues
  • Export your data to a structured format like CSV or JSON for further analysis

Scraping Dynamic Sites with Ocaml-webdriver

Many modern websites heavily rely on JavaScript to load data and render pages. Ocaml-webdriver allows controlling real browsers to scrape such dynamic sites.

First, you‘ll need to set up a Selenium WebDriver for your browser of choice by following the official Selenium documentation. For this example, we‘ll use Firefox via the geckodriver.

Once you have geckodriver running, install the ocaml-webdriver library:

opam pin add webdriver https://github.com/art-w/ocaml-webdriver.git

We‘ll scrape the dynamic movie list from https://www.scrapethissite.com/pages/ajax-javascript as an example. Here‘s the full code:

open Webdriver_cohttp_lwt_unix
open Infix

(* A helper to iterate through lists of elements *)
let rec iter_elements f lst =
  match lst with
  | [] -> return ()
  | h :: t -> 
    let* () = f h in
    iter_elements f t  

(* Scraping action *)    
let scrape_movies () =     
  (* Go to the dynamic page *)
  let* _ = go "https://www.scrapethissite.com/pages/ajax-javascript" in

  (* Click the 2015 link to load those movies *)
  let* _ = click_el (find_el_by_css "a[href=‘#2015‘]") in

  (* Wait for movie list to load *)
  let* _ = wait_for_css ".film-title" in

  (* Extract movie titles *)
  let* titles = find_els_by_css ".film-title" in
  let* _ = iter_elements (fun title ->
    let* title_text = text title in      
    Printf.printf "%s\n" title_text;
    return ()
  ) titles in
  return ()

(* Set up browser *)  
let () =
  Lwt_main.run (
    with_session firefox_headless (fun env ->      
      catch
        (fun () -> scrape_movies env)
        (fun exn ->
          (match exn with
           | Webdriver _ ->
             Printf.printf "Webdriver failure: %s" (Printexc.to_string exn)
           | _ -> raise exn);
          return ())          
    )      
  )

This code:

  1. Defines a helper iter_elements to loop through element lists

  2. Defines the scraping action:

    • Navigates to the dynamic movie page
    • Clicks the 2015 link to load those movies
    • Waits for the movie titles to appear
    • Finds all movie title elements
    • Prints out each movie title
  3. Sets up a headless Firefox browser session and runs the scraping action, handling any WebDriver errors.

Build and run it:

dune exec ./my_scraper.exe

You should see output like:

Spotlight
The Big Short  
Bridge of Spies
...  

Ocaml-webdriver provides many functions for finding elements, interacting with them, and waiting for conditions. Refer to its API documentation to support your specific scraping needs.

Some tips for browser scraping:

  • Use waits judiciously to handle content loading, but don‘t wait too long
  • Avoid complex actions like drag-and-drop that may break
  • Keep your scrapers single-purpose for maintainability

Alternatives to Web Scraping

While web scraping is useful, it‘s not always the best approach. Some alternatives:

  • Many sites provide official APIs to access their data in a structured format. Using APIs is usually faster and more reliable than scraping.
  • For simpler extraction needs, no-code tools like ScrapingBee can quickly fetch and process pages without writing code.

Scraping should be a last resort when no APIs are available. When you do scrape, respect website terms of service and robots.txt restrictions.

Conclusion

OCaml is a powerful language for web scraping needs simple to advanced. Its strong typing, expressiveness, and active library ecosystem make it a joy to build maintainable scrapers in.

We walked through basic scraping of static pages using Cohttp and Lambdasoup, as well as dynamic scraping with Ocaml-webdriver. You should now have a foundation to scrape most sites, whether simple quote lists or modern JavaScript-heavy apps.

There‘s much more to learn about OCaml and scraping, from concurrent requests to browser fingerprinting. Here are some resources to dive deeper:

  • Real World OCaml, a comprehensive book on OCaml
  • HTTrack, a popular website copying tool
  • ScrapingBee‘s blog, with web scraping guides and tutorials

Happy scraping with OCaml!

Join the conversation

Your email address will not be published. Required fields are marked *