перейти к содержанию

How to Scrape Data from the Web Using Go and Colly

Web scraping is the process of programmatically extracting data from websites. It allows you to gather information from across the internet and use it for analysis, archiving, app building, machine learning, or any number of other purposes.

While there are many languages and tools that can be used for web scraping, in this article we‘ll focus on how to scrape data in Go using the fantastic Colly framework. Go‘s simple syntax, strong typing, and built-in concurrency features make it an excellent choice for writing scrapers.

We‘ll walk through a practical example of using Go and Colly to extract data from a Hacker News comment thread. By the end of this tutorial, you‘ll know how to:

  • Install and configure Colly
  • Navigate through websites and follow links
  • Extract text, attributes, and HTML from elements on a page
  • Deal with pagination
  • Output the scraped data

Давайте погрузимся!

Настройка среды Go

First, make sure you have a recent version of Go installed (1.13+). You can check your current version with:

$ go version

Create a new directory for your project and initialize a Go module in it:

$ mkdir hackernews-scraper
$ cd hackernews-scraper 
$ go mod init github.com/yourusername/hackernews-scraper

Next, install the Colly package:

$ go get -u github.com/gocolly/colly/...

Now create a file named main.go – this is where we‘ll write the code for our scraper.

Introducing Colly – A Powerful Scraping Framework for Go

Colly provides a clean and expressive API for writing web scrapers in Go. It handles many low-level details like managing concurrent requests, handling cookies and sessions, following redirects, respecting robots.txt, etc.

The core concept in Colly is the "Collector", which represents a scraping job. You create a Collector, configure it with settings and callbacks, then start the job by providing one or more URLs to visit.

As the Collector crawls pages, it executes callbacks in response to certain events – for example, you can register callbacks to run when Colly:

  • Visits a new URL
  • Finds a particular HTML element
  • Receives an HTTP response
  • Encounters an error

This event-driven architecture makes it easy to extract just the data you need without getting lost in parsing logic. Let‘s see it in action!

Extracting Data from Hacker News with Go and Colly

As an example, let‘s write a scraper that extracts the top-level comments from a Hacker News post and prints them to the console. We‘ll scrape this post: https://news.ycombinator.com/item?id=12345

Here‘s the complete code for our scraper:

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "strings"

    "github.com/gocolly/colly"
)

type comment struct {
    Author  string `selector:"a.hnuser"`
    URL     string `selector:".age a[href]" attr:"href"`
    Comment string `selector:".comment"`
}

type post struct {
    Title    string    `selector:"a.storylink"`
    URL      string    `selector:"a.storylink" attr:"href"`
    Comments []comment  
}

func main() {
    // Create a new collector
    c := colly.NewCollector(
        colly.AllowedDomains("news.ycombinator.com"),
    )

    p := post{}
    // Extract post title and URL
    c.OnHTML(".fatitem", func(e *colly.HTMLElement) {
        e.Unmarshal(&p)
    })

    // Extract comments
    c.OnHTML(".comment-tree tr.athing", func(e *colly.HTMLElement) {
        comment := comment{}
        e.Unmarshal(&comment)

        // Remove extra spacing from comments
        comment.Comment = strings.TrimSpace(comment.Comment)
        // Build absolute URL from relative URL
        comment.URL = e.Request.AbsoluteURL(comment.URL)  

        p.Comments = append(p.Comments, comment)
    })

    // Set max depth to 1 so we only visit the post and its comments
    c.MaxDepth = 1 

    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL)
    })

    c.OnError(func(r *colly.Response, e error) {
        log.Printf("Error scraping %s: %s", r.Request.URL, e)
    })

    c.OnScraped(func(r *colly.Response) {
        // Dump results as JSON
        enc := json.NewEncoder(os.Stdout)
        enc.SetIndent("", "  ")
        enc.Encode(p)
    })

    c.Visit("https://news.ycombinator.com/item?id=12345")
}

Let‘s break it down:

  1. We define two struct types, post и comment, to store the scraped data. Notice the "selector" and "attr" tags – these tell Colly how to locate this data in the DOM.

  2. In main, we create a new Collector, specifying that it‘s only allowed to visit pages on news.ycombinator.com.

  3. We register two OnHTML callbacks:

    • The first one looks for the .fatitem element, which contains the post title and URL, and uses Unmarshal to extract them into our post структура.
    • The second callback looks for .comment-tree elements and extracts the author, relative URL, and text of each comment, cleaning them up a bit and appending them to the Comments array of our post структура.
  4. We register a few other callbacks:

    • OnRequest logs each URL we visit
    • OnError logs any errors that occur during scraping
    • OnScraped runs after the entire scrape job is done and dumps our scraped data as formatted JSON
  5. Наконец, мы вызываем c.Visit with the URL of the post we want to scrape, which kicks off the actual scraping job. Colly will follow links and invoke our callbacks as it crawls.

Запустите код с помощью:

$ go run main.go

You should see output like:

Visiting https://news.ycombinator.com/item?id=12345
{
  "Title": "Example HN Post", 
  "URL":  "http://example.com",
  "Comments": [
    {
      "Author": "user1", 
      "Comment": "First comment!",
      "URL": "https://news.ycombinator.com/item?id=12346"
    },
    ...
  ]
}

Tada! With just a few dozen lines of code, we‘ve extracted structured data from a web page. Of course, there‘s a lot more you can do – this just scratches the surface of Colly‘s API. Be sure to check out the Colly docs and examples to learn about features like:

  • Автоматическая обработка файлов cookie и сеансов
  • Parallel scraping with goroutines
  • Caching responses
  • Setting request headers, proxies, and timeouts
  • Restricting crawl rate and depth

Beyond Basic Scraping – Challenges at Scale

Although Colly and Go make it simple to get started with web scraping, things get trickier when you try to scrape large amounts of data from major websites. You‘ll quickly run into anti-bot countermeasures like:

  • Rate limiting and CAPTCHAs
  • IP bans when crawling too fast or too much from one IP
  • Dynamic, JavaScript-rendered pages that are difficult to scrape
  • Frequent changes to a site‘s structure that break your scrapers

To get around these, you‘ll need to distribute your scrapers across many IPs, automate CAPTCHAs and proxies, render JS, and constantly monitor and update your code. It‘s doable, but requires significant time and infrastructure to get right.

If you‘d rather focus on actually using the data you scrape instead of fighting anti-bot measures, check out ScrapingBee. It‘s an API that manages all this headache for you – just send it a URL and get back structured JSON data, without worrying about proxies, CAPTCHAs, or IP bans. It will even render JavaScript pages for you. The first 1000 API calls are free, so give it a try for your next scraping project!

Заключение

Web scraping is an incredibly useful skill to have in your developer toolbelt, and the combination of Go and Colly make it accessible and enjoyable. In this article, we‘ve walked through a complete example of scraping a Hacker News comment thread – hopefully this real-world use case has given you a taste of what‘s possible and inspired you to try scraping yourself!

Remember, always be respectful when scraping and follow websites‘ robots.txt and terms of service. Now get out there and liberate some data!

Присоединяйтесь к беседе

Ваш электронный адрес не будет опубликован. Обязательные поля помечены * *