Web scraping is an essential skill for data professionals looking to gather information from the vast expanse of the internet. One powerful tool in the web scraping arsenal is HtmlUnit – an open-source "GUI-less browser for Java programs". As a headless browser, HtmlUnit allows you to automate interaction with websites, filling out forms and clicking buttons, while providing convenient methods to extract data using CSS and XPath selectors.
In this comprehensive guide, we‘ll walk through how to use HtmlUnit to scrape scrapethissite.com, a website specifically designed for web scraping practice with increasingly complex challenges. By the end, you‘ll have a solid understanding of HtmlUnit‘s capabilities and how to apply them to real-world scraping tasks.
Why HtmlUnit is Ideal for Scraping Scrapethissite.com
HtmlUnit is particularly well-suited for scraping scrapethissite.com for several reasons:
-
JavaScript Support: Many of the realistic examples on scrapethissite.com, such as the AJAX/JavaScript page, heavily utilize JavaScript to dynamically load content. HtmlUnit has built-in JavaScript processing, allowing it to execute scripts and wait for dynamic content to load before extracting data. This is crucial for scraping modern web applications.
-
Form Interaction: Scrapethissite.com includes exercises for submitting forms, such as a pagination example with a search box. With HtmlUnit, automating form submission is straightforward. You can easily locate form elements, input text, select dropdowns, click checkboxes, and submit the form, all with just a few lines of code.
-
Handling Authentication: Some scraping tasks require logging into a website first. Scrapethissite.com has a login page example to practice this. HtmlUnit can automate the login process by filling out the login form and persisting cookies across requests, much like a real browser.
-
Clean, Well-Structured Pages: Compared to many real-world websites, the HTML on scrapethissite.com is generally clean and well-structured. This makes it an ideal place to practice using HtmlUnit‘s CSS and XPath selector methods to pinpoint the data you want to extract.
While HtmlUnit is useful for scraping a wide variety of sites, scrapethissite.com provides a perfect training ground to learn its ins and outs.
Setting Up a Java Project with HtmlUnit
Before we dive into scraping, let‘s set up a basic Java project with HtmlUnit. You‘ll need:
- Java Development Kit (JDK) 8 or above installed
- A build tool like Maven or Gradle (this guide will use Maven)
- An IDE or text editor
First, create a new Maven project from the command line:
mvn archetype:generate -DgroupId=com.example -DartifactId=htmlunit-scraper -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
Navigate into the project directory:
cd htmlunit-scraper
Open the pom.xml
file and add the HtmlUnit dependency inside the <dependencies>
block:
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.64.0</version>
</dependency>
You‘re now ready to start writing a scraper!
Scraping Scrapethissite.com: A Basic Example
Let‘s start with a simple task: scraping the list of countries and their capital cities from the Countries page.
Create a new Java class named Scraper
inside the src/main/java/com/example
directory. Add the following code:
package com.example;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.DomNode;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.IOException;
import java.util.List;
import java.util.stream.Collectors;
public class Scraper {
public static void main(String[] args) {
try (WebClient client = new WebClient()) {
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
String baseUrl = "https://scrapethissite.com/pages/simple/";
HtmlPage page = client.getPage(baseUrl);
List<String> countries = page.getByXPath("//h3[@class=‘country-name‘]")
.stream()
.map(DomNode::getTextContent)
.map(String::trim)
.collect(Collectors.toList());
List<String> capitals = page.getByXPath("//span[@class=‘country-capital‘]")
.stream()
.map(DomNode::getTextContent)
.collect(Collectors.toList());
for (int i = 0; i < countries.size(); i++) {
System.out.printf("%s - %s\n", countries.get(i), capitals.get(i));
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Here‘s what‘s happening:
-
We create a new
WebClient
instance, disabling CSS and JavaScript support since this page doesn‘t require it. This improves performance. -
We load the Countries page using
client.getPage(url)
, which returns anHtmlPage
object. -
To extract the country names, we use an XPath selector
//h3[@class=‘country-name‘]
to find all<h3>
elements with the classcountry-name
. We map over the resulting list ofDomNode
s, extracting their text content, trimming whitespace, and collecting into a list of strings. -
Similarly, we extract the capital cities using the XPath
//span[@class=‘country-capital‘]
. -
Finally, we loop through both lists in parallel, printing out each country and its capital.
Run this with:
mvn compile exec:java -Dexec.mainClass="com.example.Scraper"
You should see output like:
Andorra - Andorra la Vella
United Arab Emirates - Abu Dhabi
Afghanistan - Kabul
...
With just a few lines of HtmlUnit code, we‘ve extracted structured data from the page. However, this example only scratches the surface of what‘s possible.
Advanced Scraping: Forms and Pagination
Scrapethissite.com includes more complex examples that require interacting with the page before scraping. Let‘s tackle the Pagination + Search exercise.
The goal is to search for a specific team, "New York Rangers", navigate through the paginated results, and extract each year and number of wins.
Here‘s how we can implement this:
HtmlPage startPage = client.getPage("https://scrapethissite.com/pages/forms/");
// Fill out the search form
HtmlForm form = startPage.getFormByName("hockey-form");
form.getInputByName("team").type("New York Rangers");
HtmlSubmitInput submitButton = form.getInputByName("submit");
HtmlPage currentPage = submitButton.click();
int totalWins = 0;
int pageNumber = 1;
// Keep crawling until we reach the last page
while (true) {
System.out.printf("--- Page %d ---\n", pageNumber);
List<HtmlElement> rows = currentPage.getByXPath("//table/tbody/tr");
for (HtmlElement row : rows) {
String year = row.getFirstByXPath(".//td[@class=‘year‘]").getTextContent();
String wins = row.getFirstByXPath(".//td[@class=‘wins‘]").getTextContent();
totalWins += Integer.parseInt(wins);
System.out.printf("%s - %s wins\n", year, wins);
}
// Check if there‘s a next page link
HtmlAnchor nextLink = currentPage.getFirstByXPath("//a[@class=‘next‘]");
if (nextLink == null) {
break;
}
// If there is, click it to navigate to the next page
currentPage = nextLink.click();
pageNumber++;
}
System.out.printf("Total wins: %d\n", totalWins);
Step by step:
-
We start by loading the forms page and locating the search form by its name, "hockey-form".
-
We find the text input for the team name, fill it with "New York Rangers", and click the submit button to load the first page of results.
-
We enter a loop that continues until there are no more "next page" links. Inside the loop:
- We find all the table rows containing team data using an XPath selector.
- For each row, we extract the year and number of wins, parse the wins as an integer, and add it to a running total.
- We print out the year and wins for each row.
- After processing all rows, we look for a "next page" link. If found, we click it to navigate to the next page. If not, we exit the loop.
-
Finally, we print the total number of wins across all pages.
This example demonstrates how HtmlUnit can automate form interaction and handle pagination, two common challenges in web scraping.
Exporting Scraped Data
Scraping data is only half the battle – to be useful, it often needs to be saved in a structured format like CSV for further analysis. Let‘s modify our previous example to write the scraped data to a CSV file.
First, add the OpenCSV dependency to your pom.xml
:
<dependency>
<groupId>com.opencsv</groupId>
<artifactId>opencsv</artifactId>
<version>5.5.2</version>
</dependency>
Then update the scraper code:
import com.opencsv.CSVWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
// ... (scraping code remains the same)
// Write data to CSV
String csv = "data.csv";
try (CSVWriter writer = new CSVWriter(new FileWriter(csv))) {
String[] header = {"Year", "Wins"};
writer.writeNext(header);
for (HtmlElement row : rows) {
String year = row.getFirstByXPath(".//td[@class=‘year‘]").getTextContent();
String wins = row.getFirstByXPath(".//td[@class=‘wins‘]").getTextContent();
String[] rowData = {year, wins};
writer.writeNext(rowData);
}
System.out.printf("Data written to %s\n", csv);
}
We create a CSVWriter
instance pointing to a file named "data.csv". We write a header row with column names, then loop through the scraped rows, writing the year and wins data to the CSV. The resulting file will look like:
Year,Wins
1926,0
1927,0
...
By exporting scraped data to a standard format, you can easily load it into other tools for analysis, visualization, or storage.
Best Practices and Troubleshooting
As you work with HtmlUnit, keep these best practices in mind:
-
Use XPath and CSS selectors judiciously. Overly complex selectors can break if the page structure changes slightly. Where possible, rely on IDs, class names, and element attributes to target elements precisely.
-
Limit concurrent requests to avoid overloading servers or triggering rate limiting. HtmlUnit doesn‘t automatically throttle requests.
-
Set timeouts appropriately, especially when JavaScript is enabled. Pages that load data asynchronously may need longer timeouts.
-
Handle exceptions gracefully, particularly when scraping large numbers of pages. Log errors and continue scraping where possible rather than halting the entire program.
-
Regularly monitor and maintain your scrapers. Websites change over time, so expect to update selectors and logic periodically.
Common issues you may encounter include:
-
Inconsistent page loading: If HtmlUnit fetches a page before JavaScript has fully executed, you may get incomplete data. Increase timeouts or use explicit waits for specific elements to appear.
-
CAPTCHAs and bot detection: Some sites use CAPTCHAs or other techniques to block scrapers. HtmlUnit can‘t solve CAPTCHAs automatically. For these sites, you may need to investigate alternate scraping approaches or use a CAPTCHA-solving service.
-
Memory leaks: HtmlUnit can consume significant memory, especially when JavaScript is enabled. Call
webClient.close()
when you‘re done with a client instance to free resources. -
Slow performance: For large-scale scraping jobs, HtmlUnit may be slower than alternatives like Puppeteer or Scrapy. Consider optimizations like disabling images, CSS, and JavaScript where possible.
By being aware of these challenges and following best practices, you can build robust, maintainable HtmlUnit scrapers.
Beyond HtmlUnit: Scaling Up
HtmlUnit is an excellent tool for scraping small to medium websites, but it has limitations. As your scraping needs grow, you may run into challenges like:
- JavaScript-heavy sites that HtmlUnit struggles to render correctly
- Anti-bot measures that block scrapers
- Rate limiting and CAPTCHAs
- Sheer scale of data to be scraped
For these situations, consider leveling up to a full browser automation tool like Selenium or Puppeteer, which provide more advanced features and flexibility. Alternatively, for large-scale scraping across many sites, a dedicated web scraping framework like Scrapy can simplify development and handle challenges like request throttling, retries, and proxy rotation out of the box.
Finally, if you prefer to focus on data analysis rather than scraper development, consider a web scraping API or no-code tool. These services handle the complexities of scraping behind the scenes, providing clean, structured data via a simple API.
Conclusion
In this guide, we‘ve explored the power of HtmlUnit for web scraping, using scrapethissite.com as our training ground. We‘ve learned how to:
- Set up a Java project with HtmlUnit
- Scrape static pages using CSS and XPath selectors
- Automate form submissions and handle pagination
- Extract and export structured data to CSV
- Apply best practices and troubleshoot common issues
- Evaluate HtmlUnit‘s limitations and alternatives for large-scale scraping
With this knowledge, you‘re well-equipped to tackle a wide variety of web scraping tasks using HtmlUnit. Remember, scraping is an art as much as a science – the key is to experiment, iterate, and learn from each website you encounter. Happy scraping!