Web scraping, the process of programmatically extracting data from websites, is an incredibly useful skill for data scientists, researchers, marketers, and developers alike. While Python tends to be the go-to language for scraping, Haskell‘s strong static typing, purity, and expressive semantics make it a great alternative, especially for those who value correctness and maintainability.
In this in-depth guide, we‘ll walk through how to scrape websites using popular Haskell libraries like Scalpel and webdriver. By the end, you‘ll have the knowledge and tools to extract, process, and store data from any webpage. Let‘s dive in!
Basic Scraping with Scalpel
For simple, static websites, Scalpel is a lightweight and easy-to-use choice. Scalpel provides a declarative API for building "scrapers" – objects that specify what parts of an HTML document to extract.
Here‘s a basic example of using Scalpel to scrape a list of cities from Wikipedia:
-- Make an HTTP request and scrape the list of cities
allCities :: IO (Maybe [City])
allCities = scrapeURL "https://en.wikipedia.org/wiki/List_of_largest_cities" cities
-- Select the table of cities and return a list of City records
cities :: Scraper Text [City]
cities = chroot ("table" @: [hasClass "wikitable"]) $
chroots "tr" city
-- Extract data from each table row into a City record
city :: Scraper Text City
city = do
name <- text "td"
country <- text ("td" // "a")
pop <- attr "data-sort-value" $ "td" @: [hasClass "data-sort-value"]
return $ City name country pop
Let‘s break this down. The cities
scraper selects the wikitable and returns a list of City
objects by applying the city
scraper to each table row. The city
scraper extracts the city name, country, and population from specific cells, returning a City
record. Finally, allCities
kicks off the actual scraping, making an HTTP request and feeding the response body to cities
.
One of Haskell‘s key advantages shines here – the type system. See how the City
data type guides the entire scraping process, providing structure, documentation, and compile-time guarantees that the code is correct. Python‘s duck typing would allow silent failures.
Storing and Processing Data
To store our list of scraped cities, we could write them to a CSV file using the Cassava library:
instance ToNamedRecord City where
toNamedRecord (City name country pop) =
namedRecord [ "name" .= name
, "country" .= country
, "population" .= pop ]
writeCities :: [City] -> IO ()
writeCities cities =
writeToNamedFile "cities.csv" cities
Here we declare a ToNamedRecord
instance for our City
type so Cassava knows how to serialize it. We can then pass the list of cities to writeCities
to save them to a CSV file.
For more complex data processing tasks, Haskell has great libraries like Conduit for building pipelines to stream, filter, and transform data. For example, to find the top 10 cities by population:
main :: IO ()
main = do
cities <- fromMaybe [] <$> allCities
let top10 = take 10 $ sortBy (comparing (readInt . population)) cities
mapM_ print top10
Handling Dynamic Pages with Selenium
For JavaScript-heavy single page apps, we need to use a real browser to scrape dynamic content. Haskell‘s webdriver bindings allow automating browsers like Chrome via Selenium.
Here‘s how we might scrape reviews from Airbnb:
scrapeReviews :: Text -> IO (Maybe [Review])
scrapeReviews url = runSession chromeConfig $ do
openPage url
-- Wait for reviews to load
waitUntil 10 $ findElem $ ByCSS "div._1k01n3v"
reviewElems <- findElems $ ByCSS "div._eeq7h0"
reviews <- forM reviewElems $ \el -> do
name <- findElemFrom el $ ByCSS "div._1rgfcp0d" >>= getText
date <- findElemFrom el $ ByCSS "div._1jlr0ynl" >>= getText
text <- findElemFrom el $ ByCSS "div._lhzyenl" >>= getText
return $ Review name date text
closeSession
return reviews
After navigating to the listing URL, we wait for the reviews to be dynamically loaded by repeatedly querying for a specific DOM node up to 10 seconds. Once the reviews appear, we find each review "card", extracting the author, date, and text into a Review
record.
A few things to note when browser automating:
- It‘s slower than sending raw HTTP requests, so be judicious and cache results when you can.
- Selenium tests can be brittle to HTML changes. Use IDs and data attributes to select elements when possible.
- Respect the site owners and don‘t hammer their servers. Throttle your request rate if doing large scale scraping.
Error Handling & Monitoring
Web scraping is inherently a bit fragile since you‘re depending on the structure of 3rd party websites outside your control. It‘s important to build resiliency to errors and changes.
Some tips:
Use Haskell‘s Either and Maybe to represent and handle errors:
data ScrapeError = HttpError HttpException
| ParseError Text
| EmptyError
scrapeReviews :: Text -> IO (Either ScrapeError [Review])
Log and monitor your scraper using a service like Rollbar to get notified of failures.
If a site‘s HTML changes and breaks your scraper, consider falling back to an older scraper implementation targeting the previous structure. You can gradually phase it out as you update your main scraper.
Comparing to Other Languages
Haskell certainly has less web scraping resources and libraries compared to giants like Python or JavaScript. However, the gap is narrowing with solid libraries like Scalpel, Wreq, HandsomeSoup, and Selenium bindings.
Haskell‘s advantages really pay dividends for large, complex scraping projects:
- Strong typing catches errors early and acts as guardrails and documentation
- Purity makes code easier to test and reason about
- Lazy evaluation and powerful abstractions like Conduit allow efficiently processing huge datasets
- Great concurrency support for parallel scraping
But for quick ‘n dirty scraping jobs, the extensive ecosystem and imperative nature of Python and JavaScript make them better choices. Use the best tool for the job!
Wrapping Up
In this guide, we‘ve seen how to robustly scrape static and dynamic websites using Haskell. While not as mainstream as other languages, Haskell‘s powerfully expressive semantics make it a joy for web scraping.
To learn more, check out these resources:
- Scalpel documentation
- webdriver Selenium bindings
- Haskell web scraping cookbooks
- Vim Haskell setup tutorial
Happy scraping!