Skip to content

Curl Web Scraping: A Detailed Guide to Scraping Google SERPs with Curl and Terminal

Curl is a versatile command line tool that allows you to transfer data using various protocols. With curl, you can download or upload files, interact with APIs, scrape web pages, and more. In this comprehensive guide, we‘ll explore how to leverage curl for web scraping, specifically for scraping Google search engine results pages (SERPs).

What is Curl?

Curl stands for "Client URL". It is a command line tool that transfers data using various protocols such as HTTP, HTTPS, FTP, POP3, and more. Some key features of curl:

  • Open source and free to use on most operating systems like Linux, Windows, MacOS etc.
  • Supports multiple protocols including HTTP, HTTPS, FTP, SFTP, SCP, Telnet, LDAP, MQTT, POP3, IMAP, SMTP and more.
  • Can download or upload files from/to remote servers.
  • Works well with REST APIs. Sends various types of HTTP requests like GET, POST, PUT, DELETE etc.
  • Can scrape web pages and extract data.
  • Supports proxies, cookies, authentication, SSL connections and more.

Curl is installed by default on most Linux and MacOS systems. For Windows, you may have to download and install it separately.

Why Use Curl for Web Scraping?

Here are some key advantages of using curl for web scraping tasks:

  • No Browser Needed: Curl runs from the command line so you don‘t need to automate an actual browser. This makes it fast and lightweight.

  • Scripting Capabilities: Curl commands can be executed from bash/terminal scripts allowing you to automate complex scraping tasks.

  • Portability: Curl is available on almost all platforms so your scraping scripts can run on different OS without changes.

  • Proxies and Authentication: Curl easily integrates with proxies and supports multiple authentication mechanisms like user/password, API keys etc. This is very useful for web scraping.

  • JSON Parsing: Curl can automatically parse JSON response saving you effort of extracting data from HTML.

  • Ease of Use: Curl commands are relatively simple and easy to use for basic scraping tasks.

So in summary, curl provides a simple and fast way to scrape web pages programmatically without needing an actual browser. Next, we‘ll see how to use it for scraping Google SERPs.

Scraping Google SERPs with Curl

Google search results contain a wealth of structured data and scraping them can be very useful. However, Google employs several anti-scraping mechanisms that can detect and block scrapers. To avoid getting blocked while scraping Google, it‘s best to use proxies.

We‘ll use the Smartproxy Web Scraping API which provides robust residential proxies and easy-to-use curl commands for SERP scraping.

Here are the steps:

1. Sign up for Smartproxy API

First, sign up on Smartproxy and subscribe to the Web Scraping API. After logging into the dashboard, go to Proxies -> Authentication and create a username/password for proxy authentication.

2. Encode Credentials

We need to encode the username/password into a base64 encoded string for passing in the request headers:

# Encode credentials 
$ echo -n ‘username:password‘ | openssl base64

# Sample output
dXNlcm5hbWU6cGFzc3dvcmQ=

3. Construct the Curl Request

Here is how we construct a curl request for scraping Google SERPs:

# Base API endpoint 
API_URL=https://webapi.smartproxy.com/api/v1/task

# Setup headers and data
HEADERS=(
  ‘Content-Type: application/json‘
  "Authorization: Basic $(echo -n ‘username:password‘ | openssl base64)"  
)

DATA=$(cat << EOF
{
  "url": "https://www.google.com/search?q=curl+scraping",
  "parse_rules": {
    "results": {
      "listItem": ".tF2Cxc",
      "name": "h3",
      "link": "a"
    }
  } 
}
EOF
)

# Construct curl request
curl -X POST $API_URL \
     -H "${HEADERS[*]}" \
     -d "$DATA"

Let‘s break this down:

  • We define the API endpoint, headers, and data payload separately for readability.
  • The headers contain content-type and authorization using the encoded username/password.
  • The data payload defines the target URL and parsing rules for extracting search results.
  • Finally, we construct the curl request with appropriate headers and data.

This will send the scraping request through Smartproxy‘s residential proxies and return parsed results.

4. Handling Results

By default, the API response will be in JSON format. You can pipe the output to jq to filter and format the data:

curl ... | jq ‘.tasks[].result.result‘

This filters the JSON to only show the scraped search results.

To save results directly into a file, you can redirect the output:

curl ... > results.json

The API has many other options like setting geo location, user agents, parsing rules etc. Refer to the documentation for more details.

Scraping Google on Terminal with Curl

Now that we‘ve seen how to leverage Smartproxy‘s API, let‘s try scraping Google directly using curl on the terminal.

1. Getting Search Results Page

We‘ll start by fetching the search results HTML:

curl -s "https://www.google.com/search?q=web+scraping" > results.html

This downloads the first page of search results for "web scraping" into a file called results.html.

The -s flag ensures silent operation without any progress meters.

Next, we‘ll extract the title and links from the HTML using grep and sed:

# Extract Title
cat results.html | grep ‘<title>‘ | sed ‘s/<\/title>//‘ | sed ‘s/<title>//‘

# Extract Links 
cat results.html | grep ‘<a href‘ | sed ‘s/.*href=\"\(.*\)\".*//g‘

This uses a simple regular expression to match and extract title and anchor tag href attributes.

3. Scraping Additional Pages

To scrape beyond the first page, we need to programmatically generate the next page URLs and scrape them.

Here‘s a simple script to scrape and extract data from the first 3 search pages:

#!/bin/bash

query="web+scraping"

for i in {1..3}
do
   # Construct page URL
   url="https://www.google.com/search?q=${query}&start=${i}0"

   # Fetch page  
   curl -s "$url" > "page_${i}.html"

   # Extract title
   grep ‘<title>‘ "page_${i}.html" | sed ‘s/<\/title>//‘ | sed ‘s/<title>//‘

   # Extract links
   grep ‘<a href‘ "page_${i}.html" | sed ‘s/.*href=\"\(.*\)\".*//g‘ 

   echo
done

This generates the start parameter (start=10, start=20 etc) to page through results and extracts data from each page.

While this works, scraping beyond a few pages will likely get blocked by Google. To do large scale scraping, we need proxies which we‘ll see next.

4. Using Proxies

To leverage proxies with curl, we can use the --proxy option:

curl --proxy http://USERNAME:PASSWORD@IP:PORT https://www.google.com

This will route the request via the proxy server.

To avoid getting blocked, we need a large, diverse, and frequently rotating pool of residential proxies, like Smartproxy‘s 40M+ IPs.

Smartproxy provides curl compatible proxy URLs that can directly be used:

# Get proxy URL
curl http://PATTERN-USERNAME:[email protected]/getProxy?format=curl

# Sample output
http://USER:[email protected]:30000

# Use with curl
curl --proxy http://USER:[email protected]:30000 https://www.google.com

By integrating Smartproxy‘s proxies into your scripts, you can scrape Google at scale without worrying about blocks.

Some other tips for effective scraping include:

  • Use random delays between requests
  • Rotate user agents using the --user-agent flag
  • Handle captchas and follow robots.txt directives

This covers the basics of scraping Google search with curl. With a few lines of code, you can easily extract titles, links, and other data from SERPs without needing an automated browser.

For advanced use cases, also consider a dedicated web scraping API like Smartproxy which handles proxy rotation, browsers, and parsing automatically.

Scraping Other Sites with Curl

The techniques we discussed can be adapted to scrape any site or web page with curl.

Here are some quick examples:

Scrape Hacker News Headlines

curl -s https://news.ycombinator.com | 
  grep -E -o ‘<a href="[^\"]+" class="storylink">‘ |
  sed ‘s/.*<a href=\"\(.*\)\" class=\"storylink\">.*//g‘

Scrape Product Listing from Amazon:

curl -s "https://www.amazon.com/s?k=laptops" |
  grep -E -o ‘<h2.*?data-attribute.*?<a .*?>‘ | 
  sed ‘s/^.*data-attribute.*?>//‘ |
  sed ‘s/<\/a>.*//‘  

Scrape User Profiles from Twitter:

curl -s "https://twitter.com/explore" |
  grep -E -o ‘<span.*? class="username.*?>.*?<b>‘ |
  sed ‘s|^.*>||‘ |
  sed ‘s|</b>||‘

As you can see, with grep, sed and curl, you can easily scrape and parse content from many sites.

Scraping JavaScript Rendered Sites

A major limitation of curl is that it cannot render JavaScript. So scraping sites like Google Maps, Facebook, Youtube etc. will only get you the initial HTML without the dynamically loaded content.

To scrape JS sites, you need to execute the JavaScript first. Some options are:

  • Browser Automation – Use Python + Selenium, Puppeteer etc. to drive a real browser.

  • Headless Browsers – PhantomJS, HtmlUnit etc. are lightweight headless browsers that can render JS.

  • Proxy Service – Smartproxy and other vendors provide proxy services that can render JS behind the scenes.

For basic scraping, curl is usually enough. But for heavy JS sites, you will need to incorporate a JS rendering solution.

Scraping Best Practices

When scraping with curl or any other tool, make sure to follow ethical scraping best practices:

  • Scrape publicly accessible data only. Do not hack into restricted areas.
  • Check site terms to avoid violating policies. Follow their directives like robots.txt.
  • Avoid aggressive scraping that can overload the target servers.
  • Use proxies and random delays to spread out requests.
  • Identify yourself properly with a meaningful user agent.
  • Do not share or resell scraped data without permission.
  • Scrape data that you have the rights to use and store.

Adhering to best practices ensures that your scraping causes no harm while acquiring useful data.

Conclusion

Curl is a simple yet powerful command line tool for web scraping. With curl, proxies, and basic UNIX utilities like grep and sed, you can extract large amounts of data from the web easily without needing an actual browser.

Some key highlights:

  • Curl lets you transfer data using multiple protocols with ease.
  • It works well for basic scraping tasks without requiring heavy browser automation.
  • You can write curl scripts to automate complex scraping workflows.
  • Using proxies helps avoid blocks when scraping at scale.
  • Curl has some limitations when sites rely heavily on JavaScript.

Overall, curl offers a straightforward way to get started with web scraping using basic Linux/Unix utilities. With robust proxies and practices, you can leverage curl to extract data from almost any public website.

Join the conversation

Your email address will not be published. Required fields are marked *