Have you ever needed to extract data from websites that don‘t provide an API? Web scraping allows you to programmatically fetch the HTML of webpages and parse out the data you need. While this is possible to do from scratch in Java by crafting HTTP requests and parsing the raw HTML yourself, using a library like JSoup makes the process much easier.
In this in-depth tutorial, we‘ll cover everything you need to know to master HTML parsing in Java using the popular JSoup library. Whether you‘re new to web scraping or an experienced developer, this guide will equip you with the knowledge and code samples to extract data from any website. Let‘s dive in!
What Is HTML Parsing and Why Is It Useful?
HTML parsing refers to the process of analyzing the HTML source code of a webpage to extract structured data from it. This is a core component of web scraping – the automated retrieval of data from websites. HTML parsing allows you to take unstructured webpage data and convert it into a structured format like JSON or save it in a database.
There are many reasons you might need to parse HTML and scrape websites:
- Collecting data for analysis, such as product information, reviews, prices, etc.
- Building datasets for machine learning
- Monitoring websites for changes or new information
- Archiving web content
- Integrating data from sites that don‘t provide APIs
Whenever you need data that is available on websites but not accessible via pre-built APIs, HTML parsing and web scraping is the answer.
Introducing JSoup
JSoup is the most popular Java library for parsing HTML. It provides a convenient API for extracting and manipulating data using DOM traversal and CSS selectors. JSoup offers the following key features:
- Parse HTML from URLs, files, or strings
- Find and extract data using DOM traversal or CSS selectors
- Manipulate HTML elements, attributes, and text
- Clean user-submitted content against a safelist to prevent XSS attacks
- Output parsed HTML in pretty-print format
JSoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag soup. It‘s an excellent choice for scraping websites – let‘s see how to use it.
Installing and Setting Up JSoup
JSoup is available on Maven Central. To use it in your project, add the following dependency to your Maven POM file:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version>
</dependency>
Or if using Gradle:
implementation ‘org.jsoup:jsoup:1.14.3‘
Then import JSoup into your Java code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
Now you‘re ready to start parsing HTML!
Connecting to Webpages and Parsing HTML
The first step is connecting to a webpage and fetching the HTML. JSoup provides a convenient connect()
method that allows you to simply pass in a URL:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
This sends an HTTP request to the specified URL, parses the response HTML, and returns a Document
object that you can then query to extract data.
The Document
object represents the entire HTML of the webpage as a nested data structure. You can also parse HTML from a string or file:
Document doc = Jsoup.parse("<html><head><title>Example</title></head>"
+ "<body><p>Hello world!</p></body></html>");
Document doc = Jsoup.parse(new File("example.html"), "UTF-8");
Once you have a Document
, you can start selecting elements and extracting data.
Selecting HTML Elements Using CSS Selectors
JSoup elements are selected using a CSS or jQuery-like selector syntax. This should be familiar if you‘ve used CSS or jQuery before.
Element content = doc.getElementById("content");
Elements links = doc.select("a[href]");
Elements pngs = doc.select("img[src$=.png]");
Element masthead = doc.select("div.masthead").first();
Elements resultLinks = doc.select("h3.r > a");
getElementById(String id)
– find an element by ID.getElementsByTag(String tag)
– find elements by tag name.getElementsByClass(String className)
– find elements by class name.select(String selector)
– find elements matching a CSS selector.
Examples:
Selector | Description |
---|---|
* |
any element |
tag |
elements with this tag name |
#id |
elements with attribute ID of "id" |
.class |
elements with a class name of "class" |
[attr] |
elements with an attribute named "attr" |
[attr=val] |
elements with an attribute named "attr" and value equal to "val" |
[attr^=val] |
elements with an attribute named "attr" and value starting with "val" |
[attr$=val] |
elements with an attribute named "attr" and value ending with "val" |
[attr*=val] |
elements with an attribute named "attr" and value containing "val" |
parent child |
child elements that descend from parent |
siblingA + siblingB |
finds sibling B element immediately preceded by sibling A |
siblingA ~ siblingX |
finds sibling X element preceded by sibling A |
There are many additional CSS selectors that JSoup supports – checkout the JSoup selector documentation for a complete list.
Extracting Data from Selected Elements
Once you have selected elements, you can extract data from them using these methods:
String text = el.text(); // get the text content
String htmlText = el.html(); // get the inner HTML
String attr = el.attr("href"); // get an attribute value
String id = el.id(); // get the element ID
Elements children = el.children(); // get child elements
Element parent = el.parent(); // get the parent element
Elements siblings = el.siblingElements(); // get sibling elements
You can use these methods in combination with methods like first()
, last()
, and eq(int index)
to extract text and attributes from elements matching your selectors:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
String pageTitle = doc.title(); // parse page title
String firstParagraph = doc.select("p").first().text(); // get text of first <p>
String imageSource = doc.select("img").first().attr("src"); // get src attr of first <img>
Handling Pagination and Iterating Through Pages
Most websites that we want to scrape have content spread across multiple pages. There are two approaches to handle pagination:
- If the data you need exists on pages with predictable URLs (like
site.com/page/1
,site.com/page/2
, etc.) then you can use a simple loop to increment the page number and scrape each page:
String baseUrl = "https://example.com/page/";
for(int i = 1; i <= 10; i++) {
String url = baseUrl + i;
print("Scraping page " + i);
Document doc = Jsoup.connect(url).get();
// Select and extract data here...
}
- If the pagination URLs are not predictable or are implemented using POST parameters, query strings, or JavaScript, then you will need to select the "next page" link or button from the page itself and load that URL to scrape the next page. This can be done in a while loop until no "next page" link is found:
Document doc = Jsoup.connect(startUrl).get();
while(true) {
// Select and extract data from current page here... // Check if there is a next page link Element nextLink = doc.select("a:contains(Next)").first(); if(nextLink == null) { break; // Exit loop if no more pages } String nextUrl = nextLink.absUrl("href"); doc = Jsoup.connect(nextUrl).get(); // Load next page
}
Submitting Forms and Handling Logins
Some pages that you want to scrape may require submitting a form to access, such as performing a search or logging in. JSoup can handle form submission like this:
Document doc = Jsoup.connect("https://example.com/login") .data("username", "jsmith") .data("password", "secret123") .post();
The
data()
method sets the form data parameters to submit. You can inspect the form in your browser tools to find the names of the input fields. Then just callpost()
to submit the form and parse the response.For pages behind logins, look for login forms and submit the credentials programmatically like this before accessing the restricted pages.
Limitations of JSoup and Alternatives
While JSoup is great for many scraping needs, it does have some limitations:
- It cannot execute JavaScript, so content loaded dynamically via JS will not be accessible
- It can only make simple HTTP requests – no control over headers, cookies, etc.
For scraping needs that require JS rendering and more browser control, you can look into using a headless browser like Selenium or Puppeteer instead. These tools run an actual browser behind the scenes.
Tips and Best Practices
Here are some tips for effective and responsible scraping with JSoup:
- Always check a site‘s robots.txt and terms of service before scraping
- Insert delays between requests to avoid overloading servers
- Set the user-agent header to something descriptive
- Cache pages locally if you need to re-run parsing
- Use minimal selectors for faster parsing
- Handle errors gracefully and log failures
- Monitor scrapers and adapt to changes in site structures
Conclusion
You should now be fully equipped to tackle HTML parsing and web scraping using JSoup in your Java projects! We‘ve covered all the fundamentals:
- Connecting to URLs and parsing HTML responses
- Using CSS selectors to find the elements you need
- Extracting text, attributes, and child elements
- Paginating across pages
- Submitting forms and handling logins
- Limitations and best practices
So what are you waiting for? Go forth and scrape the web! Let me know in the comments what awesome projects you build with your newfound knowledge.