Skip to content

Web Scraping with Haskell: A Comprehensive Guide

Web scraping, the process of programmatically extracting data from websites, is an incredibly useful skill for data scientists, researchers, marketers, and developers alike. While Python tends to be the go-to language for scraping, Haskell‘s strong static typing, purity, and expressive semantics make it a great alternative, especially for those who value correctness and maintainability.

In this in-depth guide, we‘ll walk through how to scrape websites using popular Haskell libraries like Scalpel and webdriver. By the end, you‘ll have the knowledge and tools to extract, process, and store data from any webpage. Let‘s dive in!

Basic Scraping with Scalpel

For simple, static websites, Scalpel is a lightweight and easy-to-use choice. Scalpel provides a declarative API for building "scrapers" – objects that specify what parts of an HTML document to extract.

Here‘s a basic example of using Scalpel to scrape a list of cities from Wikipedia:

-- Make an HTTP request and scrape the list of cities
allCities :: IO (Maybe [City])  
allCities = scrapeURL  "https://en.wikipedia.org/wiki/List_of_largest_cities" cities

-- Select the table of cities and return a list of City records
cities :: Scraper Text [City]
cities = chroot ("table" @: [hasClass "wikitable"]) $ 
         chroots "tr" city

-- Extract data from each table row into a City record       
city :: Scraper Text City
city = do
  name <- text "td" 
  country <- text ("td" // "a")
  pop <- attr "data-sort-value" $ "td" @: [hasClass "data-sort-value"]
  return $ City name country pop

Let‘s break this down. The cities scraper selects the wikitable and returns a list of City objects by applying the city scraper to each table row. The city scraper extracts the city name, country, and population from specific cells, returning a City record. Finally, allCities kicks off the actual scraping, making an HTTP request and feeding the response body to cities.

One of Haskell‘s key advantages shines here – the type system. See how the City data type guides the entire scraping process, providing structure, documentation, and compile-time guarantees that the code is correct. Python‘s duck typing would allow silent failures.

Storing and Processing Data

To store our list of scraped cities, we could write them to a CSV file using the Cassava library:

instance ToNamedRecord City where
    toNamedRecord (City name country pop) = 
        namedRecord [ "name" .= name
                    , "country" .= country  
                    , "population" .= pop ]

writeCities :: [City] -> IO ()
writeCities cities = 
    writeToNamedFile "cities.csv" cities

Here we declare a ToNamedRecord instance for our City type so Cassava knows how to serialize it. We can then pass the list of cities to writeCities to save them to a CSV file.

For more complex data processing tasks, Haskell has great libraries like Conduit for building pipelines to stream, filter, and transform data. For example, to find the top 10 cities by population:

main :: IO ()
main = do
    cities <- fromMaybe [] <$> allCities
    let top10 = take 10 $ sortBy (comparing (readInt . population)) cities
    mapM_ print top10

Handling Dynamic Pages with Selenium

For JavaScript-heavy single page apps, we need to use a real browser to scrape dynamic content. Haskell‘s webdriver bindings allow automating browsers like Chrome via Selenium.

Here‘s how we might scrape reviews from Airbnb:

scrapeReviews :: Text -> IO (Maybe [Review]) 
scrapeReviews url = runSession chromeConfig $ do
    openPage url

    -- Wait for reviews to load
    waitUntil 10 $ findElem $ ByCSS "div._1k01n3v" 

    reviewElems <- findElems $ ByCSS "div._eeq7h0"
    reviews <- forM reviewElems $ \el -> do
        name <- findElemFrom el $ ByCSS "div._1rgfcp0d" >>= getText
        date <- findElemFrom el $ ByCSS "div._1jlr0ynl" >>= getText
        text <- findElemFrom el $ ByCSS "div._lhzyenl" >>= getText
        return $ Review name date text

    closeSession
    return reviews

After navigating to the listing URL, we wait for the reviews to be dynamically loaded by repeatedly querying for a specific DOM node up to 10 seconds. Once the reviews appear, we find each review "card", extracting the author, date, and text into a Review record.

A few things to note when browser automating:

  • It‘s slower than sending raw HTTP requests, so be judicious and cache results when you can.
  • Selenium tests can be brittle to HTML changes. Use IDs and data attributes to select elements when possible.
  • Respect the site owners and don‘t hammer their servers. Throttle your request rate if doing large scale scraping.

Error Handling & Monitoring

Web scraping is inherently a bit fragile since you‘re depending on the structure of 3rd party websites outside your control. It‘s important to build resiliency to errors and changes.

Some tips:

Use Haskell‘s Either and Maybe to represent and handle errors:

data ScrapeError = HttpError HttpException 
                 | ParseError Text
                 | EmptyError  

scrapeReviews :: Text -> IO (Either ScrapeError [Review])

Log and monitor your scraper using a service like Rollbar to get notified of failures.

If a site‘s HTML changes and breaks your scraper, consider falling back to an older scraper implementation targeting the previous structure. You can gradually phase it out as you update your main scraper.

Comparing to Other Languages

Haskell certainly has less web scraping resources and libraries compared to giants like Python or JavaScript. However, the gap is narrowing with solid libraries like Scalpel, Wreq, HandsomeSoup, and Selenium bindings.

Haskell‘s advantages really pay dividends for large, complex scraping projects:

  • Strong typing catches errors early and acts as guardrails and documentation
  • Purity makes code easier to test and reason about
  • Lazy evaluation and powerful abstractions like Conduit allow efficiently processing huge datasets
  • Great concurrency support for parallel scraping

But for quick ‘n dirty scraping jobs, the extensive ecosystem and imperative nature of Python and JavaScript make them better choices. Use the best tool for the job!

Wrapping Up

In this guide, we‘ve seen how to robustly scrape static and dynamic websites using Haskell. While not as mainstream as other languages, Haskell‘s powerfully expressive semantics make it a joy for web scraping.

To learn more, check out these resources:

Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *