Introduction to Web Scraping With Java

Web scraping is the process of extracting data from websites in an automated way. It‘s an incredibly powerful technique that enables gathering information from sites that don‘t provide an official API. Whether you need to collect product listings, real estate data, financial information, or any other web-based data at scale, web scraping allows you to do that.

In this guide, we‘ll walk through how to perform web scraping using Java. Java is a great language choice for web scraping because of its strong typing, multithreading capabilities, and rich ecosystem of open source libraries. We‘ll cover the basics of web scraping, the tools you‘ll need, and provide a detailed example of scraping classified ads from Craigslist. Let‘s jump in!

What You‘ll Need

To follow along with the code examples, make sure you have the following:

Java Development Kit (JDK) version 8+
A Java IDE like Eclipse, IntelliJ, or NetBeans
HtmlUnit and Jackson libraries added to your project

Web Scraping 101

At a high level, web scraping involves the following steps:

Fetching the HTML source of the target web page
Parsing the HTML to extract the desired data
Storing the extracted data in a structured format

Fetching the page source is typically done using an HTTP client to send a GET request to the URL of the page you want to scrape. Once you have the HTML, you need to parse it to locate the specific data you‘re interested in within the page‘s structure. This is commonly done using techniques like CSS selectors or XPath to query elements.

Finally, with the raw data extracted, you‘ll likely want to save it in a useful format like JSON or CSV for further processing and analysis. The scraping process can then be applied across many pages by following links.

Now that we understand the basic steps, let‘s apply them to a concrete example.

Scraping Craigslist Ads

To demonstrate web scraping in Java, we‘ll write a program to extract data from Craigslist, a popular classifieds website with sections dedicated to jobs, housing, for sale, services, and more.

The goal will be to scrape the listings for a particular search query in a given city and output key data points like:

Posting title
Price
URL to full listing

We‘ll break this down step-by-step.

Finding the Right Page to Scrape

First we need to determine the URL of the page containing the listings we want. Let‘s say we‘re interested in iPhone 13 listings in New York. We can use Craigslist‘s search and drill down to:

https://newyork.craigslist.org/search/sss?query=iphone+13

This search results page will be the target for our scraper.

Inspecting the Page Structure

Next, we need to analyze the HTML structure of the search results page to determine how to locate the data we want.

Using the browser‘s developer tools, we can see that each result row is contained in a <li class="result-row"> element. Within those elements, the data we need is structured like:

<a href="https://newyork.craigslist.org/mnh/mob/d/new-york-apple-iphone-13-pro-max-128gb/7494493246.html" class="result-title hdrlnk">Apple iPhone 13 Pro Max - 128GB - Silver (Unlocked)</a>

<span class="result-price">$899</span>

With this knowledge, we‘re ready to start writing our Java scraper.

Fetching the HTML

To fetch the HTML, we‘ll use the HtmlUnit library which provides a simple API for programmatically interacting with web pages. First define a WebClient:

WebClient client = new WebClient(); client.getOptions().setCssEnabled(false); client.getOptions().setJavaScriptEnabled(false);

This configures an HtmlUnit client with CSS and JavaScript disabled for faster performance since we only need the raw HTML.

Next, construct the URL and fetch the page:

String searchUrl = "https://newyork.craigslist.org/search/sss?query=iphone+13"; HtmlPage page = client.getPage(searchUrl);

The getPage method performs an HTTP request and returns an HtmlPage object containing the page source which we can parse to extract data.

Parsing and Extracting Data

With the fetched HTML, we can now parse out the data for each result row using XPath:

List itemNodes = page.getByXPath("//li[@class=‘result-row‘]");


for(DomNode itemNode : itemNodes) {

HtmlAnchor titleAnchor = ((HtmlAnchor) itemNode.getFirstByXPath(".//p[@class=‘result-info‘]/a"));

HtmlElement priceSpan = ((HtmlElement) itemNode.getFirstByXPath(".//a/span[@class=‘result-price‘]"));
String title = titleAnchor.getTextContent();
String url = titleAnchor.getHrefAttribute();
String price = priceSpan.getTextContent();

System.out.println(String.format("Title: %s\nURL: %s\nPrice: %s\n", title, url, price));

}

This code grabs all li.result-row elements, then for each one, extracts the title, URL, and price using relative XPath expressions. We store the extracted data in simple String variables for now, but we could create a custom Java object to represent each listing.

Outputting Data

For more advanced use cases, you‘ll probably want to save the extracted data in a structured format like JSON. We can modify the parsing loop to build a JSON object for each listing:

ObjectMapper mapper = new ObjectMapper();


for(DomNode itemNode : itemNodes) {

// ...
ObjectNode listing = mapper.createObjectNode();
listing.put("title", title);
listing.put("url", url);
listing.put("price", price);

String jsonString = mapper.writeValueAsString(listing);
System.out.println(jsonString);

}

This uses the Jackson library to generate a JSON object node, populate it with our extracted data, and serialize it to a string. You could further customize the output or write it to a file.

Taking it Further

Handling Pagination

So far we‘ve only scraped the first page of Craigslist results, but we can extend our scraper to handle pagination.

The basic approach is to check for a "next page" link, extract its URL, and repeat the fetching and parsing process for each page. HtmlUnit makes this easy; just check if an element like a.next exists and call getPage on its href value in a loop:

HtmlAnchor nextPageLink = null; do { // parse and extract data from current page // ...


try {
    nextPageLink = page.getFirstByXPath("//a[@class=‘next‘]");
    if(nextPageLink != null) {
        page = nextPageLink.click();
    }
} catch(Exception e) {
    nextPageLink = null;
}

} while(nextPageLink != null);

This will continuously parse each page of results until no "next" link is found.

Customizing Search Criteria

Craigslist supports many different search parameters that you can incorporate into your scraper. For example, to limit results to only those posted today:

String searchUrl = "https://newyork.craigslist.org/search/sss?query=iphone+13&postedToday=1";

Or to include results from nearby areas:

String searchUrl = "https://newyork.craigslist.org/search/sss?query=iphone+13&postedToday=1&searchNearby=1";

The full list of supported query parameters can be found on Craigslist‘s search URL. Modifying the search criteria programmatically based on your scraping needs allows for powerful customization.

Additional Tips

Web scraping is a complex task in real-world scenarios. Here are some additional tips to keep in mind:

Set a Reasonable Request Rate

When scraping a site, it‘s important to limit the frequency of your requests to avoid overloading the server or getting your IP blocked. Add delays of a few seconds between requests and consider using proxies if you need to scrape a large number of pages.

Handle Errors Gracefully

Scrapers can encounter many issues like network failures, bot detection, or unexpected page structure changes. Make sure your code handles exceptions and can resume gracefully.

Render JavaScript-Heavy Pages

While HtmlUnit works great for static HTML, some modern sites rely heavily on JavaScript to load content dynamically. In those cases, you may need a full browser automation tool like Selenium to fully render pages before scraping.

Consider ScrapingBee

The web scraping process can get very complex very quickly, especially when you need to execute JavaScript, solve CAPTCHAs, and avoid detection. If you want to avoid having to worry about all this, you can use the ScrapingBee API, which manages the entire scraping process behind an easy to use interface. The first 1,000 API calls are free.

Final Thoughts

Web scraping is an invaluable skill for anyone needing to extract data from the internet. With the Java techniques covered in this tutorial, you‘re well on your way to scraping data from any website. Just remember to always be respectful and use scraped data responsibly.

While we focused on HtmlUnit for simplicity, Java has several other powerful libraries like jsoup and Selenium that are worth exploring for different scraping needs. The core concepts remain the same.

Hopefully, this guide has given you a solid foundation in web scraping with Java. Scraping opens up a world of data possibilities. So choose a site, start small, and see what valuable data you can uncover. Happy scraping!