Web scraping is a powerful technique that allows you to extract data from websites programmatically. While Python is a popular language for web scraping, Java is also a great option thanks to its performance, strong typing, and extensive ecosystem of libraries. One standout Java library for web scraping is Jaunt.
Jaunt is a fast, lightweight Java library that makes web scraping easy with a simple, intuitive API. Compared to other web scraping solutions, Jaunt has several benefits:
- Single JAR file with no dependencies for easy setup
- Automatically handles cookies, redirects, form submission
- Renders JavaScript-heavy single page apps
- Built-in support for AJAX content
- Works in headless mode without a GUI
In this tutorial, you‘ll learn how to install Jaunt and use it to scrape both static websites and dynamic multi-page web apps. We‘ll cover basic techniques like extracting elements using CSS selectors, as well as more advanced topics like form handling, navigation between pages, and pagination.
By the end, you‘ll be equipped to scrape a wide variety of websites using Java and Jaunt. Let‘s get started!
Prerequisites
To follow along with this tutorial, you‘ll need to have Java installed. This tutorial uses Java 8, but Jaunt works with Java 7+.
You‘ll also need to download the Jaunt JAR file. Head over to the official download page and grab the Core API package.
Once you have the jaunt-1.6.1.jar file (or similar), place it in your Java project‘s classpath. If you‘re using an IDE like IntelliJ or Eclipse, you can right-click the project, select "Open Module Settings", and add the Jaunt JAR under the Dependencies tab.
Now you‘re ready to start using Jaunt for web scraping in Java!
Basic Web Scraping
Let‘s start with a basic example of using Jaunt to scrape a static website – the Wikipedia page on web scraping. We‘ll scrape the page title and the list of links in the References section.
Here‘s the full code:
import com.jaunt.*;
public class WikiScraper {
public static void main(String[] args) {
UserAgent userAgent = new UserAgent();
try {
userAgent.visit("https://en.wikipedia.org/wiki/Web_scraping");
String title = userAgent.doc.findFirst("<h1 id=\"firstHeading\">").getTextContent();
System.out.println("Page title: " + title);
Element referencesHeading = userAgent.doc.findFirst("<span class=\"mw-headline\" id=\"References\">");
Element referencesList = referencesHeading.findNextSibling("<ul>");
Elements links = referencesList.findEvery("<a>");
System.out.println("References:");
for(Element link : links) {
System.out.println(" - " + link.getAt("href"));
}
} catch(JauntException e) {
System.err.println(e);
}
}
}
First we create a new UserAgent
object. This is the core class in Jaunt that allows us to interact with web pages.
We then use the visit()
method to navigate to the Wikipedia URL. This fetches the page content which is stored in userAgent.doc
.
To extract the page title, we use the findFirst()
method to search for the first h1
tag with the ID "firstHeading". This returns an Element
, and we retrieve the text content using getTextContent()
.
Next, to get the references, we first locate the "References" heading using findFirst()
to search for a span
with a specific class and ID. To get the actual list of references, we find the next sibling ul
element using findNextSibling()
.
Finally, we extract all the a
elements from the references list using findEvery()
. This returns an Elements
collection which we can iterate over, printing the "href" attribute for each link.
The output looks like:
Page title: Web scraping
References:
- https://en.wikipedia.org/wiki/Web_crawler
- https://en.wikipedia.org/wiki/Robert_Baumgartner
- https://en.wikipedia.org/wiki/Bytescout
- https://web.archive.org/web/20220216162945/http://cdn.aiindex.org/2021/AI-Index-2021-Chapter-6.pdf
- https://en.wikipedia.org/wiki/File:AIIndex_Survey_Web_Scraping.jpg
- https://en.wikipedia.org/wiki/ISBN_(identifier)
...
As you can see, with just a few lines of code we were able to extract some key pieces of data from the Wikipedia page. Jaunt makes web scraping in Java straightforward by providing a convenient API to find and extract elements from the page DOM.
Now let‘s look at a more advanced example that involves interacting with dynamic page elements.
Advanced Web Scraping
For this example, we‘ll scrape hockey team stats from this demo page. The page has a search form that allows filtering the teams table, as well as pagination links to navigate through the search results.
Here‘s the code to scrape the first two pages of "New" teams:
import com.jaunt.*;
import java.io.FileWriter;
public class HockeyStats {
public static void main(String[] args) {
UserAgent userAgent = new UserAgent();
try {
// Navigate to search page
userAgent.visit("https://scrapethissite.com");
userAgent.doc.findFirst("<a href=\"/pages/forms/\">").click();
// Submit search form
userAgent.doc.filloutField("q", "New");
userAgent.doc.findFirst("<input type=‘submit‘ value=‘Search‘>").click();
// Scrape first page
scrapeTable(userAgent.doc, "new-teams-page1.csv");
// Navigate to next page and scrape it
Element nextLink = userAgent.doc.findFirst("<a>");
nextLink.click();
scrapeTable(userAgent.doc, "new-teams-page2.csv");
} catch(JauntException e) {
System.err.println(e);
}
}
private static void scrapeTable(Document doc, String filepath) throws Exception {
Element table = doc.findFirst("<table>");
FileWriter csv = new FileWriter(filepath);
Elements rows = table.findEach("<tr>");
for(Element row : rows) {
String line = "";
Elements cells = row.findEach("<td>, <th>");
for(Element cell : cells) {
line += cell.getTextContent() + ",";
}
line = line.replaceAll(",$", "");
csv.write(line + "\n");
}
csv.close();
}
}
This example builds on the concepts from the basic scraper, with a few new techniques:
First, we navigate through a series of pages to get from the home page to the search form, using a combination of the visit()
and click()
methods.
To submit the search form, we use the filloutField()
method to enter "New" in the search box. We then find the submit button using an attribute selector <input type=‘submit‘>
and click()
it.
At this point, the hockey teams table is filtered to only those teams with "New" in the name. So we‘re ready to scrape the data, which we‘ve broken out into a scrapeTable
helper function.
The scrapeTable
function takes the current Document
object (representing the page DOM) and a filepath to write out a CSV file. It starts by locating the <table>
element, then iterates through each <tr>
extracting the text content of the <th>
and <td>
elements, which it writes to the CSV.
After scraping the first page, we find the "2" pagination link using findFirst("<a>")
. The attribute selector looks for an
<a>
tag that contains the text "2". We then click()
this link to navigate to the second page of results. Finally, we call our scrapeTable
helper again to extract data from the second page.
Running this code produces two CSV files with the scraped data:
new-teams-page1.csv
:
Team Name,Year,Wins,Losses,Win %,...
New Jersey Devils,2019,28,29,0.4,...
New York Islanders,2019,35,23,0.5,...
New York Rangers,2019,37,28,0.45,...
new-teams-page2.csv
:
Team Name,Year,Wins,Losses,Win %,...
New York Rangers,2012,51,24,0.637
New Jersey Devils,2011,48,28,0.6
New York Islanders,2011,34,37,0.425
...
As you can see, Jaunt allows you to scrape even dynamic, multi-page websites with ease. The API provides methods to interact with page elements like form fields and links, and extracting data is straightforward using DOM methods like findFirst
and findEvery
.
Conclusion
Web scraping is an essential tool for data collection, and Jaunt makes it simple to scrape websites using Java. With an intuitive API, built-in browser emulation, and powerful DOM querying capabilities, Jaunt can handle a wide range of scraping tasks from basic static pages to complex dynamic web apps.
In this tutorial, we covered installing Jaunt and writing a basic Wikipedia page scraper to introduce key concepts like the UserAgent
, Document
, and DOM querying with findFirst()
/findEvery()
. We then built a more advanced hockey stats scraper to demonstrate form handling, pagination, and outputting data to CSV.
There are some cases where Jaunt may not be the best fit, such as websites that are heavily protected against scraping or require logged-in sessions. In those situations, you may need to look at alternatives like Selenium WebDriver or a pre-built web scraping API like ScrapingBee.
But for a wide range of scraping needs, Jaunt is a excellent choice that balances ease of use with the performance and power of Java. I encourage you to try it out on your next web scraping project.
You can find the full code samples from this tutorial on GitHub. Happy scraping!