Skip to content

Web Scraping With Linux And Bash: The Ultimate Guide for 2024

Web scraping is the process of automatically extracting data and content from websites. It allows you to retrieve information from across the internet and utilize it for your own purposes, whether that‘s storing it in a database, transforming it into a new format, or incorporating it into an application.

While web scraping is commonly done using high-level programming languages like Python, Java, or JavaScript frameworks, it‘s also very much possible to create powerful web scrapers using just the Linux command line and some standard Bash tools. In this ultimate guide, we‘ll take an in-depth look at why you might want to do web scraping on Linux, what tools are available, and walk through several concrete examples to collect different types of data from the web. Let‘s jump right in!

Why Use Linux and Bash for Web Scraping?

If you‘re familiar with web scraping using languages like Python or tools like Puppeteer, you may be wondering why you would want to do it with Linux command line utilities instead. After all, those high-level languages provide rich libraries and frameworks specifically designed for crawling and scraping websites. So what are the advantages of the Linux approach?

The main benefits come down to:

  • Lightweight – Linux command line tools are very lightweight and efficient compared to full-fledged programming language runtimes, web browsers, etc. This makes them well-suited for focused scraping tasks that don‘t require heavy lifting.

  • Already installed – Most Unix-based systems like Linux and macOS come with the command line tools we‘ll be discussing pre-installed out of the box. No extra setup or configuration is needed to start scraping.

  • Composability – The Unix philosophy emphasizes small, focused tools that do one thing well. These tools are designed to be composed together in pipelines, which is very useful for web scraping workflows involving fetching data, extracting content, processing it, and storing the results.

That said, command line web scraping with Bash does have some limitations compared to using full-featured languages. In particular, it‘s more difficult to handle websites that make heavy use of JavaScript, require complex session handling, or need to integrate with databases. Tools like Python‘s Scrapy or headless browsers are better suited for those use cases.

The Bash approach shines for more focused scraping jobs that can benefit from a lightweight, self-contained scripting environment without a lot of overhead. And even for more complex projects, it can be used for certain pieces of the pipeline.

The Linux Scraping Toolkit

Before diving into some concrete examples, let‘s take a quick tour of the key tools in a Linux scraper‘s toolkit and what they‘re used for:

  • curl – This is the workhorse that will send HTTP requests to web servers and retrieve HTML, XML, JSON, and other content for us to parse. It supports all kinds of options for customizing requests.

  • grep, sed, head, tail – These are standard Linux text processing utilities that are useful for searching and filtering content, cleaning up noise, and extracting pieces we care about.

  • jq – This tool makes parsing and manipulating JSON data on the command line a breeze. It has a full query language for filtering, reshaping, and transforming JSON documents.

  • xidel – For parsing and extracting data from HTML and XML documents, Xidel is a great tool. It allows querying content using XPath expressions and CSS selectors.

  • html2text – Handy utility for stripping HTML tags from a document and getting just the plain visible text.

Together, these tools provide a powerful and expressive environment for fetching website data, extracting the pieces you need, and transforming it into a usable format.

Examples: Scraping with Linux

Now that we have a lay of the land, let‘s walk through some actual web scraping examples you can try out on your Linux system. These will give you a hands-on feel for how the various tools fit together to pull data from different types of websites.

Example 1: Getting Your Public IP Address

For our first simple example, let‘s retrieve our public IP address by scraping a website that returns this information. There are a number of services that offer this, but we‘ll use https://api.ipify.org which returns the IP in plain text.

Here‘s the code:

#!/bin/bash
my_ip=$(curl -s https://api.ipify.org)
echo "My IP address is: $my_ip" > ip.txt

Breaking this down:

  1. We use curl to send an HTTP GET request to api.ipify.org. The -s flag tells it to work silently.
  2. The $() syntax captures the output of the curl command into a variable called my_ip.
  3. Finally, we echo the result, which is the IP address, into a file called ip.txt, prefixed with "My IP address is:".

Very simple, but it shows the basic pattern of using curl to fetch data from a URL and capturing it for further processing.

Example 2: Scraping JSON Data from the Reddit API

Next let‘s look at a more realistic example – retrieving data from Reddit‘s JSON API. Many websites offer APIs that return data in a structured JSON format, which is very convenient for scraping.

We‘ll retrieve the current top posts on the /r/linux subreddit. The URL for this is https://www.reddit.com/r/linux/top.json. But there are a couple issues we need to deal with. First, Reddit doesn‘t like requests that don‘t set a User-Agent header identifying the client. It will return an error if we use curl‘s default. Second, the API only returns 25 results at a time, so we need to paginate if we want more.

Here‘s a Bash script that handles both of these issues:

#!/bin/bash

user_agent="my-scraper-bot/1.0"
subreddit="r/linux" 
result_file="reddit.json"

after=""
count=0
while [[ $count -lt 100 ]]; do
  # Construct URL with after param for pagination
  if [[ -z $after ]]; then
    url="https://www.reddit.com/$subreddit/top.json?limit=100" 
  else
    url="https://www.reddit.com/$subreddit/top.json?limit=100&after=$after"
  fi

  # Make GET request with custom User-Agent
  json=$(curl -s -H "User-Agent: $user_agent" $url) 

  # Extract the list of posts 
  posts=$(echo $json | jq -r ‘.data.children[]‘) 
  after=$(echo $json | jq -r ‘.data.after‘)

  # Append to result file
  echo $posts >> $result_file

  ((count+=$(echo $posts | jq length)))
done

echo "Scraped $count posts from $subreddit"

This script introduces a few new concepts:

  • We construct the URL query parameters based on the after value to enable pagination. This value is included in the JSON response and tells us where to start the next page of results.
  • We use the -H option to curl to set a custom User-Agent header identifying our scraper.
  • jq is used extensively to parse out the pieces we need from the JSON response – the list of posts, the after value, and the count of results so far.
  • We keep making paginated requests until we‘ve fetched at least 100 posts.
  • Each batch of results is appended to an output file.

With a few tweaks, this same pattern can be used to scrape all kinds of data from API endpoints that return JSON.

Example 3: Scraping Structured Data from Web Pages

Not all websites have clean JSON APIs. Often we need to scrape data directly out of HTML pages. For this example, we‘ll collect some structured data from Wikipedia – specifically, a list of the 100 tallest mountains in the world.

The data we want is in a table on this page: https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth

Here‘s the scraper script:

#!/bin/bash

url="https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth"
html=$(curl -s $url)

# Extract the rank, name, height, and range of each mountain
results=$(echo $html | \
  xidel - -e "//table[@class=‘wikitable‘][1]//tr[position()>1]" \
  -e ".//td[1]" -e ".//td[2]/a" \
  -e "translate(.//td[4],‘ ,‘,‘‘)" -e ".//td[5]")

# Write to output CSV
echo "rank,mountain,height_meters,range" > mountains.csv
echo -e "$results" | 
  sed -E ‘s/\t([0-9]+)\t([^\t]+)\t([0-9]+)\t(.+)/\1,\2,\3,\4/‘ \
  >> mountains.csv
echo "Scraped data on $(echo $results | wc -l) mountains"  

The key points are:

  • We use curl to fetch the raw HTML of the page
  • xidel is used to navigate the HTML DOM and extract the specific elements we want. It takes an XPath query to specify the rows of the table we care about, and then additional queries to pull out the cells within each row for rank, name, height, and mountain range.
  • To clean up the scraped text, we use sed to convert the tab-separated values that xidel returns into a CSV format.
  • The extracted data is written out to mountains.csv

This example shows how to locate and extract a pattern of structured data from a complex HTML page. The same techniques using xidel and sed can be adapted to all kinds of other scraping tasks.

Scheduling and Automation

Once you have web scrapers written, you‘ll often want to automate running them on a schedule rather than manually invoking them. On Linux, the cron utility is used for this. It allows you to specify commands to be executed periodically – hourly, daily, weekly, etc.

To set up a cron job for a scraper:

  1. Open the crontab file for editing: crontab -e
  2. Add a line specifying when to run the scraper, e.g.: 0 * * * * /path/to/scraper.sh would run it at the start of every hour
  3. Save the file and exit. Cron will automatically pick up the changes.

Be sure to set appropriate delays between runs and limit the frequency to avoid overloading servers or getting your IP blocked.

Avoiding Detection

Speaking of getting blocked, one challenge of web scraping is avoiding detection by the websites you are scraping. Many sites monitor for excessive requests coming from individual IPs or user agents and will block traffic they suspect is a bot.

Some tips for staying under the radar:

  • Respect robots.txt files which specify rules for crawlers
  • Don‘t make too many requests too quickly. Add random delays between requests.
  • Use proxies or rotate user agent strings to distribute requests
  • If available, use APIs instead of page scraping whenever possible
  • Avoid scraping content behind login pages or paywalls without permission

Conclusion

While often overlooked in favor of heavier-weight options, web scraping using Linux command line tools is a powerful technique for focused data extraction tasks. Leveraging utilities like curl, jq, and xidel, you can pull structured data from websites and APIs with a minimum of dependencies.

In this guide, we covered the key concepts and tools needed for Bash web scraping, and walked through several examples to give you a practical starting point. You should now be equipped to fetch and process data from the web directly in your Linux terminal.

As you develop your scraping skills, you may find that some tasks are still challenging using Bash alone – especially for sites that heavily rely on JavaScript. In that case, you may want to graduate to a service like ScrapingBee that provides a full managed web scraping platform.

But for many common scraping needs, the combination of Linux and Bash is a simple and powerful option to have in your toolbox. I encourage you to try out the examples here, adapt them to your own use cases, and experience the joy of command line web scraping. Happy data hunting!

Join the conversation

Your email address will not be published. Required fields are marked *