HTML Parsing in Java with JSoup: A Comprehensive Guide

Have you ever needed to extract data from websites that don‘t provide an API? Web scraping allows you to programmatically fetch the HTML of webpages and parse out the data you need. While this is possible to do from scratch in Java by crafting HTTP requests and parsing the raw HTML yourself, using a library like JSoup makes the process much easier.

In this in-depth tutorial, we‘ll cover everything you need to know to master HTML parsing in Java using the popular JSoup library. Whether you‘re new to web scraping or an experienced developer, this guide will equip you with the knowledge and code samples to extract data from any website. Let‘s dive in!

What Is HTML Parsing and Why Is It Useful?

HTML parsing refers to the process of analyzing the HTML source code of a webpage to extract structured data from it. This is a core component of web scraping – the automated retrieval of data from websites. HTML parsing allows you to take unstructured webpage data and convert it into a structured format like JSON or save it in a database.

There are many reasons you might need to parse HTML and scrape websites:

Collecting data for analysis, such as product information, reviews, prices, etc.
Building datasets for machine learning
Monitoring websites for changes or new information
Archiving web content
Integrating data from sites that don‘t provide APIs

Whenever you need data that is available on websites but not accessible via pre-built APIs, HTML parsing and web scraping is the answer.

Introducing JSoup

JSoup is the most popular Java library for parsing HTML. It provides a convenient API for extracting and manipulating data using DOM traversal and CSS selectors. JSoup offers the following key features:

Parse HTML from URLs, files, or strings
Find and extract data using DOM traversal or CSS selectors
Manipulate HTML elements, attributes, and text
Clean user-submitted content against a safelist to prevent XSS attacks
Output parsed HTML in pretty-print format

JSoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag soup. It‘s an excellent choice for scraping websites – let‘s see how to use it.

Installing and Setting Up JSoup

JSoup is available on Maven Central. To use it in your project, add the following dependency to your Maven POM file:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>  
</dependency>

Or if using Gradle:

implementation ‘org.jsoup:jsoup:1.14.3‘

Then import JSoup into your Java code:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

Now you‘re ready to start parsing HTML!

Connecting to Webpages and Parsing HTML

The first step is connecting to a webpage and fetching the HTML. JSoup provides a convenient connect() method that allows you to simply pass in a URL:

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();

This sends an HTTP request to the specified URL, parses the response HTML, and returns a Document object that you can then query to extract data.

The Document object represents the entire HTML of the webpage as a nested data structure. You can also parse HTML from a string or file:

Document doc = Jsoup.parse("<html><head><title>Example</title></head>"
    + "<body><p>Hello world!</p></body></html>");  
Document doc = Jsoup.parse(new File("example.html"), "UTF-8");

Once you have a Document, you can start selecting elements and extracting data.

Selecting HTML Elements Using CSS Selectors

JSoup elements are selected using a CSS or jQuery-like selector syntax. This should be familiar if you‘ve used CSS or jQuery before.

Element content = doc.getElementById("content");
Elements links = doc.select("a[href]");  
Elements pngs = doc.select("img[src$=.png]");
Element masthead = doc.select("div.masthead").first();
Elements resultLinks = doc.select("h3.r > a");

getElementById(String id) – find an element by ID.
getElementsByTag(String tag) – find elements by tag name.
getElementsByClass(String className) – find elements by class name.
select(String selector) – find elements matching a CSS selector.

Examples:

Selector	Description
`*`	any element
`tag`	elements with this tag name
`#id`	elements with attribute ID of "id"
`.class`	elements with a class name of "class"
`[attr]`	elements with an attribute named "attr"
`[attr=val]`	elements with an attribute named "attr" and value equal to "val"
`[attr^=val]`	elements with an attribute named "attr" and value starting with "val"
`[attr$=val]`	elements with an attribute named "attr" and value ending with "val"
`[attr*=val]`	elements with an attribute named "attr" and value containing "val"
`parent child`	child elements that descend from parent
`siblingA + siblingB`	finds sibling B element immediately preceded by sibling A
`siblingA ~ siblingX`	finds sibling X element preceded by sibling A

There are many additional CSS selectors that JSoup supports – checkout the JSoup selector documentation for a complete list.

Extracting Data from Selected Elements

Once you have selected elements, you can extract data from them using these methods:

String text = el.text(); // get the text content  
String htmlText = el.html(); // get the inner HTML
String attr = el.attr("href"); // get an attribute value
String id = el.id(); // get the element ID
Elements children = el.children(); // get child elements
Element parent = el.parent(); // get the parent element
Elements siblings = el.siblingElements(); // get sibling elements

You can use these methods in combination with methods like first(), last(), and eq(int index) to extract text and attributes from elements matching your selectors:

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
String pageTitle = doc.title(); // parse page title
String firstParagraph = doc.select("p").first().text(); // get text of first <p>
String imageSource = doc.select("img").first().attr("src"); // get src attr of first <img>

Handling Pagination and Iterating Through Pages

Most websites that we want to scrape have content spread across multiple pages. There are two approaches to handle pagination:

If the data you need exists on pages with predictable URLs (like site.com/page/1, site.com/page/2, etc.) then you can use a simple loop to increment the page number and scrape each page:

String baseUrl = "https://example.com/page/";
for(int i = 1; i <= 10; i++) {
String url = baseUrl + i;
print("Scraping page " + i);
Document doc = Jsoup.connect(url).get();
// Select and extract data here...

}

If the pagination URLs are not predictable or are implemented using POST parameters, query strings, or JavaScript, then you will need to select the "next page" link or button from the page itself and load that URL to scrape the next page. This can be done in a while loop until no "next page" link is found:

Document doc = Jsoup.connect(startUrl).get(); while(true) { // Select and extract data from current page here... // Check if there is a next page link Element nextLink = doc.select("a:contains(Next)").first(); if(nextLink == null) { break; // Exit loop if no more pages } String nextUrl = nextLink.absUrl("href"); doc = Jsoup.connect(nextUrl).get(); // Load next page

}

Submitting Forms and Handling Logins

Some pages that you want to scrape may require submitting a form to access, such as performing a search or logging in. JSoup can handle form submission like this:

Document doc = Jsoup.connect("https://example.com/login")
    .data("username", "jsmith")
    .data("password", "secret123")
    .post();

The data() method sets the form data parameters to submit. You can inspect the form in your browser tools to find the names of the input fields. Then just call post() to submit the form and parse the response.

For pages behind logins, look for login forms and submit the credentials programmatically like this before accessing the restricted pages.

Limitations of JSoup and Alternatives

While JSoup is great for many scraping needs, it does have some limitations:

It cannot execute JavaScript, so content loaded dynamically via JS will not be accessible
It can only make simple HTTP requests – no control over headers, cookies, etc.

For scraping needs that require JS rendering and more browser control, you can look into using a headless browser like Selenium or Puppeteer instead. These tools run an actual browser behind the scenes.

Tips and Best Practices

Here are some tips for effective and responsible scraping with JSoup:

Always check a site‘s robots.txt and terms of service before scraping
Insert delays between requests to avoid overloading servers
Set the user-agent header to something descriptive
Cache pages locally if you need to re-run parsing
Use minimal selectors for faster parsing
Handle errors gracefully and log failures
Monitor scrapers and adapt to changes in site structures

Conclusion

You should now be fully equipped to tackle HTML parsing and web scraping using JSoup in your Java projects! We‘ve covered all the fundamentals:

Connecting to URLs and parsing HTML responses
Using CSS selectors to find the elements you need
Extracting text, attributes, and child elements
Paginating across pages
Submitting forms and handling logins
Limitations and best practices

So what are you waiting for? Go forth and scrape the web! Let me know in the comments what awesome projects you build with your newfound knowledge.

What Is HTML Parsing and Why Is It Useful?

Introducing JSoup

Installing and Setting Up JSoup

Connecting to Webpages and Parsing HTML

Selecting HTML Elements Using CSS Selectors

Extracting Data from Selected Elements

Handling Pagination and Iterating Through Pages

Submitting Forms and Handling Logins

Limitations of JSoup and Alternatives

Tips and Best Practices

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide