As the web continues to grow exponentially, being able to effectively crawl and index websites is becoming increasingly important for a variety of applications – from search engines to data mining to archiving web content for posterity. While there are many web crawling tools available, one of the most powerful and flexible is Apache Nutch.
In this comprehensive guide, we‘ll dive deep into what makes Nutch uniquely suited for large-scale web crawling projects. You‘ll learn step-by-step how to install and configure Nutch, define crawlable URLs, optimize your crawling, and integrate with Apache Solr to query and analyze your web data. Whether you‘re a developer, data scientist, or digital archivist, mastering Apache Nutch will open up a world of possibilities for unlocking insights from the vast troves of information on the web.
What is Apache Nutch?
Apache Nutch is an open source web crawler written in Java. It was originally developed by Doug Cutting and Mike Cafarella in 2002, and became an Apache Software Foundation project in 2005. While Nutch initially powered the crawling behind the Apache Lucene-based search engine, it has since evolved into a powerful, extensible crawler in its own right.
So what makes Nutch stand out from other web crawlers? Here are some of its key features and advantages:
- Highly scalable architecture that can run on a single machine or across a large Hadoop cluster for massive web crawls
- Pluggable architecture that allows easy extension with custom plugins for data extraction, processing, indexing, etc.
- Powerful defaults for resilient, polite crawling – auto-throttling, robots.txt support, etc.
- Tightly integrated with Apache Hadoop, Apache Gora, Apache Solr and other tools in the Apache big data ecosystem
- Active development and strong community support
With this potent combination of scalability, extensibility and built-in smarts, Nutch has been battle-tested in production at massive scale by companies like Yahoo, Adobe, and Internet Archive. So let‘s dig into how you can harness Nutch for your own web crawling needs!
Installing and Configuring Nutch
The first step is getting Nutch up and running on your machine. Nutch requires Java, so make sure you have Java 11 or higher installed.
Download a stable Nutch release from the official Apache downloads page. At the time of this writing, Nutch 1.19 is the latest version:
wget https://dlcdn.apache.org/nutch/1.19/apache-nutch-1.19-bin.zip
unzip apache-nutch-1.19-bin.zip
Test that Nutch is working:
cd apache-nutch-1.19
bin/nutch
This should display the usage info for the nutch command. If you get a JAVA_HOME error, set the JAVA_HOME environment variable to your Java installation directory.
Before you can start crawling, there‘s a few key configuration files to customize:
In conf/nutch-site.xml
:
- Set the
http.agent.name
property to a crawler name that obeys robots.txt directives (e.g.MyNutchCrawler
). This avoids unintentionally overloading web servers. - To store crawled data in Solr, set the
storage.data.store.class
property toorg.apache.gora.solr.store.SolrStore
In conf/regex-urlfilter.txt
:
- Define URL regular expression filters to constrain which pages are crawled. For focused crawls of particular domains, replace the default accept-all (+.) pattern with a specific whitelist.
With those configs in place, you‘re ready to kick off your first Nutch crawl!
Defining Seed URLs and Crawling
The pages that Nutch should start crawling from are called "seed URLs". You define these in a plain text file, one URL per line.
Create a seed.txt
file with a few target URLs to crawl:
https://nutch.apache.org/
https://lucene.apache.org/
Next, we go through the steps of a Nutch crawling cycle:
Inject the seed URLs into the Nutch crawl database:
bin/nutch inject crawl/crawldb urls
Generate a fetch list of URLs to crawl:
bin/nutch generate crawl/crawldb crawl/segments
Fetch the pages in the segment:
s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1
Parse the fetched pages:
bin/nutch parse $s1
Update the crawl database with the parsed page data:
bin/nutch updatedb crawl/crawldb $s1
To do multiple rounds of crawling, simply repeat the generate-fetch-parse-updatedb steps with a larger number of pages to fetch. This will spider out to more pages with each iteration.
After a few rounds of crawling, your custom crawler is starting to amass some real web data! The next step is to index it so you can start querying and analyzing.
Indexing and Searching with Apache Solr
While Nutch includes basic ability to read data from the crawl database, to unlock its full potential you‘ll want to pair it with a search engine like Apache Solr.
If you don‘t have Solr already, install it now:
wget https://dlcdn.apache.org/lucene/solr/8.11.2/solr-8.11.2.tgz
tar xzf solr-8.11.2.tgz
Start up Solr and create a core to store your Nutch data:
solr-8.11.2/bin/solr start
solr-8.11.2/bin/solr create -c nutch
Configure the Solr integration in nutch-site.xml
:
<property>
<name>solr.server.url</name>
<value>http://localhost:8983/solr/nutch</value>
</property>
Now you can index your crawled data into Solr:
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch index crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments -filter -normalize -deleteGone
After a minute you should see a message that some number of documents have been indexed. They‘re now ready to search!
Open up the Solr admin console at http://localhost:8983/solr/#/nutch/query. Here you can run queries to search your crawled pages.
For example, to find all pages that mention "apache" in the title:
title:apache
Some other handy queries:
content:wikipedia
– pages mentioning "wikipedia" in body textsite:org
– only results from .org domainsurl:pdf
– PDF file URLsanchor:nutch
– incoming anchor text containing "nutch"type:image
– image files like JPEGs, GIFs, PNGs
As you can see, Solr exposes a wide variety of fields from the crawled pages, allowing you to slice and dice your web data. Solr‘s faceting and analytics features are also very handy for visualizing trends across result sets.
Advanced Nutch Configuration and Usage
We‘ve walked through the basics of a Nutch crawl, but there are many more knobs you can tune to customize your crawler‘s behavior:
- Protocol plugins for handling custom URL schemes (SFTP, etc)
- Parse plugins for extracting structured data from pages (e.g. Apache Tika for binary documents)
- Index plugins for adding metadata to Solr index (geographic locations, language, sentiment, etc)
- Scoring plugins for prioritizing pages by quality metrics
- Data export plugins for saving parsed data to files or databases
Nutch also has a wealth of options for controlling the scale and politeness of the crawl. Some best practices:
- Respect
robots.txt
files andMETA NOINDEX
tags to avoid over-hitting servers - Set appropriate values for
fetcher.server.delay
andfetcher.threads.per.queue
to limit load - Configure
db.max.outlinks.per.page
to focus the crawl‘s scope - Use
URLFilters
to prune out irrelevant or low-quality pages - Enable
fetcher.store.robotstxt
to cache robots.txt files locally - Run Nutch via Hadoop MapReduce for large-scale crawls across many machines
With the proper tuning and sufficient hardware, Nutch can crawl millions or even billions of pages per day, while still being a good web citizen.
Use Cases and Applications
So now that you‘ve got a working Nutch crawler, what can you do with it? Really the possibilities are endless, but here are a few common applications:
- Vertical search engines for specific topics, industries or document types
- Price comparison and product catalog aggregation
- Web archiving and digital preservation
- Online reputation monitoring and brand management
- Academic and scientific research using web data
- Training machine learning models on web-scale corpora
- Investigative reporting and government oversight using OSINT techniques
Whatever your use case, Nutch provides a solid foundation for building web-scale crawling applications. Its pluggable architecture allows you to start simple but later extend it with custom business logic.
Conclusion
We‘ve covered a lot of ground in this guide to Apache Nutch—from installation and configuration to crawling and indexing to querying data with Solr. You should now have a firm grasp of Nutch‘s core concepts and how to apply them to your own projects.
Of course, running your own web crawler is no cakewalk. There are many challenges to running Nutch at scale—from the computational expense of crawling, to the cat-and-mouse game of blocking and anti-bot countermeasures, to the gnarly edge cases that inevitably crop up when parsing real-world web pages.
If the DIY approach to web scraping seems too daunting, you may want to consider pre-built tools or web scraping services to shoulder the load. For example, ScrapingBee provides a dead-simple API for extracting structured data from web pages, without having to worry about spinning up your own infrastructure.
Whether you build or buy your web crawling stack, one thing‘s for certain—the web will only keep growing in scope and complexity. Being able to intelligently navigate and extract insights from all those petabytes of data will be an increasingly vital skill. Mastering tools like Apache Nutch is a great way to start wrapping your head around it.
Happy crawling!