Web scraping is an essential tool in every developer‘s toolkit. Whether you need to gather data for analysis, generate leads for marketing, or build price comparison tools, being able to systematically extract data from websites is an invaluable skill.
While web scraping can be done manually, it is typically an automated process using a bot or web crawler to retrieve specific data from the web and store it in a local database or file for further analysis.
The Go programming language, created at Google and released in 2009, is particularly well-suited for web scraping tasks. Its simplicity, fast performance, and built-in concurrency features make it an ideal choice for building scrapers that can process large amounts of data efficiently.
In this guide, we‘ll walk through how to build a web scraper in Go using the popular Colly library. We‘ll start with a basic example of scraping links from a Wikipedia page, and then move on to a more advanced use case of extracting table data into a structured format like CSV.
By the end of this tutorial, you‘ll have a solid understanding of the fundamentals of web scraping with Go and Colly, as well as the knowledge to extend these techniques to scrape data for your own projects. Let‘s get started!
Why Go for Web Scraping?
Before diving into the technical details, let‘s take a moment to discuss what makes Go a great language for web scraping. Here are a few of its key advantages:
-
Simplicity – Go has a clean and easy-to-understand syntax, making it quick to learn for both new and experienced programmers. Its minimalism also makes code easier to read and maintain.
-
Fast performance – Compiled to machine code, Go executes very quickly and has excellent memory management. Scrapers written in Go can process large volumes of data efficiently.
-
Concurrency support – Go has built-in features like goroutines and channels that make it easy to write concurrent programs. This allows scrapers to perform multiple tasks simultaneously, such as sending many HTTP requests in parallel.
-
Strong standard library – Go includes a robust standard library with many useful packages for web scraping, such as net/http for making HTTP requests and regexp for parsing data using regular expressions.
-
Static typing – Go is statically typed, meaning variables have a predefined type. This makes code less error-prone and allows better optimization at compile time.
-
Easy deployment – Go programs compile to a single binary, making them simple to deploy to a server or run in a container. Go also has a small runtime footprint.
With these benefits in mind, let‘s now turn our attention to Colly, the star of the show.
Introducing Colly
Colly is an open-source web scraping and crawling framework for Go. It provides a clean and expressive API for extracting structured data from websites, with features like automatic cookie and session handling, parallel scraping, and caching.
The core of Colly is the Collector object, which manages the network communication and dispatches the HTTP requests. A Collector maintains a set of request callbacks, which define the scraping logic and are executed whenever the collector visits a URL.
Here are some of the key features of Colly:
- Clean API inspired by jQuery
- Automatic cookie and session handling
- Sync/async/parallel scraping
- Caching and request delays
- Automatic encoding detection
- Robotstxt support
Colly makes it easy to get started with web scraping in Go, abstracting away much of the boilerplate code required for common scraping tasks. With its simple and expressive API, you can write concise and maintainable scrapers.
Let‘s see how to use Colly to scrape some real-world data.
Building a Basic Web Scraper with Colly
For our first example, we‘ll build a simple program that scrapes all the links from a Wikipedia article. We‘ll be using the article on web scraping itself, located at https://en.wikipedia.org/wiki/Web_scraping.
Before writing any code, we need to install Colly. Assuming you have Go set up, simply run:
go get -u github.com/gocolly/colly/...
This will download the latest version of Colly and its dependencies.
Next, create a new Go source file and import the required packages:
package main
import (
"fmt"
"github.com/gocolly/colly"
)
We import fmt for printing output and the colly package.
Now let‘s create a Collector instance:
c := colly.NewCollector(
colly.AllowedDomains("en.wikipedia.org"),
)
Here we initialize a new Collector and specify the allowed domain, "en.wikipedia.org". This restricts the Collector to only visit URLs within this domain.
Next, we‘ll define a callback to be executed whenever the collector finds an HTML element matching a certain CSS selector:
c.OnHTML(".mw-parser-output", func(e *colly.HTMLElement) {
links := e.ChildAttrs("a", "href")
fmt.Println(links)
})
The OnHTML method registers a callback function that will be called whenever the Collector encounters an element matching the CSS selector string. In this case, we‘re looking for elements with the class "mw-parser-output", which is the main content div on Wikipedia articles.
Inside the callback, we extract all the link URLs by finding the child elements and retrieving their "href" attributes using ChildAttrs. We then print out the slice of URLs.
Finally, we kick off the scraping process by calling the Visit method:
c.Visit("https://en.wikipedia.org/wiki/Web_scraping")
And that‘s it! Here‘s the full code for our basic link scraper:
package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector(
colly.AllowedDomains("en.wikipedia.org"),
)
c.OnHTML(".mw-parser-output", func(e *colly.HTMLElement) {
links := e.ChildAttrs("a", "href")
fmt.Println(links)
})
c.Visit("https://en.wikipedia.org/wiki/Web_scraping")
}
Running this program should output a long list of all the URLs found on the Wikipedia page.
Not too complicated, right? With just a few lines of code, we were able to quickly extract some useful data from a web page. However, this is quite a basic example. Let‘s look at a slightly more complex scraping task.
Scraping Table Data into CSV
For our second example, we‘ll scrape an HTML table and save the extracted data into a CSV file. We‘ll be scraping the example table found on the W3Schools HTML Tables page: https://www.w3schools.com/html/html_tables.asp
The process will be similar to the previous example, with the addition of writing to a CSV file. First, let‘s create a new Collector:
c := colly.NewCollector()
Next, we‘ll define a callback to handle the table rows:
c.OnHTML("table#customers", func(e colly.HTMLElement) {
e.ForEach("tr", func(_ int, el colly.HTMLElement) {
writer.Write([]string{
el.ChildText("td:nth-child(1)"),
el.ChildText("td:nth-child(2)"),
el.ChildText("td:nth-child(3)"),
})
})
log.Println("Scraping finished")
})
This looks similar to the previous example, but with a few differences:
- We‘re using the ForEach method to iterate over each table row (tr element).
- Inside the loop, we select the text of the first, second, and third td element using the :nth-child(n) CSS selector.
- We write a slice of these three strings to the CSV writer.
- After the loop, we print a message indicating the scraping is complete.
Before visiting the page, we need to set up the CSV writer:
fName := "data.csv"
file, err := os.Create(fName)
if err != nil {
log.Fatal("Could not create file %q: %s\n", fName, err)
return
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
We create a file called "data.csv" and handle any potential errors. We then create a new CSV writer that will write to this file. The defer statements ensure the file is properly closed and the writer is flushed after main() finishes.
Here‘s the full code for our table scraper:
package main
import (
"encoding/csv"
"log"
"os"
"github.com/gocolly/colly"
)
func main() {
fName := "data.csv"
file, err := os.Create(fName)
if err != nil {
log.Fatalf("Could not create file %q: %s\n", fName, err)
return
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
c := colly.NewCollector()
c.OnHTML("table#customers", func(e colly.HTMLElement) {
e.ForEach("tr", func(_ int, el colly.HTMLElement) {
writer.Write([]string{
el.ChildText("td:nth-child(1)"),
el.ChildText("td:nth-child(2)"),
el.ChildText("td:nth-child(3)"),
})
})
log.Println("Scraping finished")
})
c.Visit("https://www.w3schools.com/html/html_tables.asp")
}
After running this program, you should have a "data.csv" file containing the extracted table data.
Best Practices for Web Scraping
While web scraping opens up many possibilities, it‘s important to keep in mind some best practices:
-
Read the website‘s terms of service and robots.txt – Make sure you‘re allowed to scrape the target website and follow any rules they have in place.
-
Be gentle and limit concurrent requests – Sending too many requests too quickly can put unnecessary load on the website‘s servers. Use delays between requests and don‘t be overly aggressive.
-
Cache pages when possible – Save pages locally to avoid having to re-scrape them and reduce the number of requests you make.
-
Rotate user agents and IP addresses – Some websites may block scrapers. Using different user agents and rotating IP addresses can help avoid detection.
-
Handle errors and unexpected responses gracefully – Websites change over time, so make sure your scrapers can handle errors and adapt as needed.
-
Respect websites and don‘t steal content – Use scraped data responsibly and give credit to the original sources.
By following these guidelines, you can scrape data in an ethical and sustainable way.
Next Steps
Colly provides many more features and options than we covered in this guide, so be sure to check out the official documentation to learn more: http://go-colly.org/docs/
Some potential next steps and ideas to extend your web scraping skills:
- Explore Colly‘s other callbacks and settings, such as request delays, caching, and regular expression matching
- Try scraping data from APIs and handling authentication
- Use Go‘s concurrency features to speed up your scrapers
- Store scraped data in a database rather than a CSV
- Combine Go with other tools like headless browsers for scraping single-page apps and JavaScript-heavy websites
- Build a full-fledged web scraping pipeline to automate data collection workflows
The applications of web scraping are nearly endless, limited only by your creativity and the websites you want to scrape. By leveraging Go and Colly, you now have a powerful toolset to gather the data you need quickly and efficiently.
Conclusion
Web scraping is an invaluable skill for any developer, opening the door to gathering a wealth of data for analysis, research, and application-building. The Go programming language, with its simplicity, speed, and concurrency support, is a fantastic choice for writing scrapers.
The Colly library builds on Go‘s strengths, providing a clean and productive framework for scraping websites. With its simple API and useful feature set, you can build robust and maintainable scrapers.
In this guide, we covered the basics of web scraping with Go and Colly, demonstrating how to scrape links from a Wikipedia article and extract table data into CSV format. We also touched on some best practices to follow when scraping.
With this foundation, you‘re now equipped to tackle your own scraping projects. Whether you‘re a data scientist looking to gather training data, a marketer analyzing competitors, or a hobbyist building the next big app, web scraping in Go is an essential tool to have in your tool belt.
So go out there and start scraping – there‘s a world of data waiting to be collected and put to use!