Web scraping is the process of programmatically extracting data from websites. It allows you to gather information from across the internet and use it for analysis, archiving, app building, machine learning, or any number of other purposes.
While there are many languages and tools that can be used for web scraping, in this article we‘ll focus on how to scrape data in Go using the fantastic Colly framework. Go‘s simple syntax, strong typing, and built-in concurrency features make it an excellent choice for writing scrapers.
We‘ll walk through a practical example of using Go and Colly to extract data from a Hacker News comment thread. By the end of this tutorial, you‘ll know how to:
- Install and configure Colly
- Navigate through websites and follow links
- Extract text, attributes, and HTML from elements on a page
- Deal with pagination
- Output the scraped data
Let‘s dive in!
Setting Up Your Go Environment
First, make sure you have a recent version of Go installed (1.13+). You can check your current version with:
$ go version
Create a new directory for your project and initialize a Go module in it:
$ mkdir hackernews-scraper
$ cd hackernews-scraper
$ go mod init github.com/yourusername/hackernews-scraper
Next, install the Colly package:
$ go get -u github.com/gocolly/colly/...
Now create a file named main.go
– this is where we‘ll write the code for our scraper.
Introducing Colly – A Powerful Scraping Framework for Go
Colly provides a clean and expressive API for writing web scrapers in Go. It handles many low-level details like managing concurrent requests, handling cookies and sessions, following redirects, respecting robots.txt, etc.
The core concept in Colly is the "Collector", which represents a scraping job. You create a Collector, configure it with settings and callbacks, then start the job by providing one or more URLs to visit.
As the Collector crawls pages, it executes callbacks in response to certain events – for example, you can register callbacks to run when Colly:
- Visits a new URL
- Finds a particular HTML element
- Receives an HTTP response
- Encounters an error
This event-driven architecture makes it easy to extract just the data you need without getting lost in parsing logic. Let‘s see it in action!
Extracting Data from Hacker News with Go and Colly
As an example, let‘s write a scraper that extracts the top-level comments from a Hacker News post and prints them to the console. We‘ll scrape this post: https://news.ycombinator.com/item?id=12345
Here‘s the complete code for our scraper:
package main
import (
"encoding/json"
"fmt"
"log"
"strings"
"github.com/gocolly/colly"
)
type comment struct {
Author string `selector:"a.hnuser"`
URL string `selector:".age a[href]" attr:"href"`
Comment string `selector:".comment"`
}
type post struct {
Title string `selector:"a.storylink"`
URL string `selector:"a.storylink" attr:"href"`
Comments []comment
}
func main() {
// Create a new collector
c := colly.NewCollector(
colly.AllowedDomains("news.ycombinator.com"),
)
p := post{}
// Extract post title and URL
c.OnHTML(".fatitem", func(e *colly.HTMLElement) {
e.Unmarshal(&p)
})
// Extract comments
c.OnHTML(".comment-tree tr.athing", func(e *colly.HTMLElement) {
comment := comment{}
e.Unmarshal(&comment)
// Remove extra spacing from comments
comment.Comment = strings.TrimSpace(comment.Comment)
// Build absolute URL from relative URL
comment.URL = e.Request.AbsoluteURL(comment.URL)
p.Comments = append(p.Comments, comment)
})
// Set max depth to 1 so we only visit the post and its comments
c.MaxDepth = 1
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnError(func(r *colly.Response, e error) {
log.Printf("Error scraping %s: %s", r.Request.URL, e)
})
c.OnScraped(func(r *colly.Response) {
// Dump results as JSON
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
enc.Encode(p)
})
c.Visit("https://news.ycombinator.com/item?id=12345")
}
Let‘s break it down:
-
We define two struct types,
post
andcomment
, to store the scraped data. Notice the "selector" and "attr" tags – these tell Colly how to locate this data in the DOM. -
In
main
, we create a new Collector, specifying that it‘s only allowed to visit pages onnews.ycombinator.com
. -
We register two
OnHTML
callbacks:- The first one looks for the
.fatitem
element, which contains the post title and URL, and usesUnmarshal
to extract them into ourpost
struct. - The second callback looks for
.comment-tree
elements and extracts the author, relative URL, and text of each comment, cleaning them up a bit and appending them to theComments
array of ourpost
struct.
- The first one looks for the
-
We register a few other callbacks:
OnRequest
logs each URL we visitOnError
logs any errors that occur during scrapingOnScraped
runs after the entire scrape job is done and dumps our scraped data as formatted JSON
-
Finally, we call
c.Visit
with the URL of the post we want to scrape, which kicks off the actual scraping job. Colly will follow links and invoke our callbacks as it crawls.
Run the code with:
$ go run main.go
You should see output like:
Visiting https://news.ycombinator.com/item?id=12345
{
"Title": "Example HN Post",
"URL": "http://example.com",
"Comments": [
{
"Author": "user1",
"Comment": "First comment!",
"URL": "https://news.ycombinator.com/item?id=12346"
},
...
]
}
Tada! With just a few dozen lines of code, we‘ve extracted structured data from a web page. Of course, there‘s a lot more you can do – this just scratches the surface of Colly‘s API. Be sure to check out the Colly docs and examples to learn about features like:
- Automatic cookie and session handling
- Parallel scraping with goroutines
- Caching responses
- Setting request headers, proxies, and timeouts
- Restricting crawl rate and depth
Beyond Basic Scraping – Challenges at Scale
Although Colly and Go make it simple to get started with web scraping, things get trickier when you try to scrape large amounts of data from major websites. You‘ll quickly run into anti-bot countermeasures like:
- Rate limiting and CAPTCHAs
- IP bans when crawling too fast or too much from one IP
- Dynamic, JavaScript-rendered pages that are difficult to scrape
- Frequent changes to a site‘s structure that break your scrapers
To get around these, you‘ll need to distribute your scrapers across many IPs, automate CAPTCHAs and proxies, render JS, and constantly monitor and update your code. It‘s doable, but requires significant time and infrastructure to get right.
If you‘d rather focus on actually using the data you scrape instead of fighting anti-bot measures, check out ScrapingBee. It‘s an API that manages all this headache for you – just send it a URL and get back structured JSON data, without worrying about proxies, CAPTCHAs, or IP bans. It will even render JavaScript pages for you. The first 1000 API calls are free, so give it a try for your next scraping project!
Conclusion
Web scraping is an incredibly useful skill to have in your developer toolbelt, and the combination of Go and Colly make it accessible and enjoyable. In this article, we‘ve walked through a complete example of scraping a Hacker News comment thread – hopefully this real-world use case has given you a taste of what‘s possible and inspired you to try scraping yourself!
Remember, always be respectful when scraping and follow websites‘ robots.txt and terms of service. Now get out there and liberate some data!