Skip to content

Making Concurrent Requests in Go for High-Performance Web Scraping

Go is an excellent language for web scraping thanks to its built-in concurrency features. With Go, you can easily make concurrent HTTP requests to multiple web pages simultaneously, dramatically speeding up your web scraping tasks compared to making requests sequentially. In this guide, we‘ll explore how to leverage Go‘s concurrency primitives to build a fast and efficient concurrent web scraper.

Why Concurrent Requests Are a Game-Changer

In web scraping, you often need to fetch many web pages from one or more websites. If you request each page one after the other, waiting for each response before initiating the next request, the scraping process can be painfully slow. This sequential approach doesn‘t take advantage of the time spent waiting on I/O for the requests and responses.

The solution is to make requests concurrently. While waiting on the response to one request, you can initiate additional requests in parallel. This allows your scraper to more fully utilize available resources and dramatically reduces overall runtime. Making just 10 concurrent requests can potentially provide a 10x speedup compared to sequential requests.

Go makes concurrent requests incredibly easy thanks to goroutines and channels. Let‘s see how it works.

Goroutines: Lightweight Threads for Concurrency

The key to concurrency in Go is the goroutine. A goroutine is like a lightweight thread managed by the Go runtime. Goroutines enable you to write concurrent code that efficiently utilizes system resources.

To start a new goroutine, you simply use the go keyword before a function or method call. This tells Go to run that function in a new goroutine. The goroutine will execute concurrently with other goroutines in the same program.

Here‘s a simple example:

func main() {
    go fmt.Println("I‘m in a goroutine!")
    fmt.Println("I‘m in the main goroutine")
}

When you run this, you might see the outputs in either order or only see the main output. That‘s because the main goroutine may exit before the new goroutine has a chance to finish. To ensure the main goroutine waits for other goroutines to complete, we can use a WaitGroup (more on that later).

Making Concurrent HTTP Requests

Armed with goroutines, we can now look at how to make concurrent HTTP requests in Go. We‘ll extend the basic http.Get example to make multiple concurrent requests.

Here‘s the code:

func fetch(url string, wg *sync.WaitGroup) {
    defer wg.Done()
    resp, err := http.Get(url)
    if err != nil {
        fmt.Println(err)
        return 
    }
    defer resp.Body.Close()
    body, err := io.ReadAll(resp.Body)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf("%s: %d bytes\n", url, len(body))
}

func main() {
    urls := []string{
        "https://example.com",
        "https://golang.org",
        "https://gobyexample.com",
    }

    var wg sync.WaitGroup
    for _, url := range urls {
        wg.Add(1)
        go fetch(url, &wg)
    }
    wg.Wait()
}

This makes concurrent requests to fetch the contents of three URLs. The fetch function performs the actual HTTP request. In main, we start a new goroutine for each URL using go fetch(url).

We use a WaitGroup to wait for all goroutines to finish before the program exits. wg.Add(1) increments the WaitGroup counter, wg.Done() decrements it, and wg.Wait() blocks until the counter reaches zero, signalling all goroutines have completed.

When you run this, you‘ll see output like:

https://golang.org: 8157 bytes
https://gobyexample.com: 6138 bytes 
https://example.com: 1256 bytes

The order of outputs may vary between runs, reflecting the non-deterministic nature of concurrency.

Scaling Up with More Goroutines

The beauty of goroutines is you can start thousands of them with minimal overhead. The Go runtime multiplexes goroutines onto a smaller number of actual operating system threads. This means you can have a large number of concurrent requests without creating a new thread for each one.

Let‘s modify our example to make 100 concurrent requests:

func main() {
    urls := make([]string, 100) 
    for i := range urls {
        urls[i] = fmt.Sprintf("https://example.com/page%d", i)
    }

    var wg sync.WaitGroup
    for _, url := range urls {
        wg.Add(1)
        go fetch(url, &wg)
    }
    wg.Wait()
}

Here we generate 100 URLs to fetch and start a goroutine for each one. The program will make all 100 requests concurrently, wait for all of them to complete, and then exit. Making this many concurrent requests can speed up the scraping process by orders of magnitude compared to sequential requests.

Best Practices for Concurrent Requests

While concurrent requests are a powerful tool, there are some best practices to keep in mind to use them safely and efficiently:

  1. Rate Limiting: Many servers will rate limit you or even block your IP if you make too many requests too quickly. Be a good citizen by adding delays between requests or limiting the maximum number of concurrent requests.

  2. Error Handling: With concurrent requests, multiple errors can happen simultaneously. Make sure to handle errors from each goroutine. You can send errors on a shared error channel to handle them from the main goroutine.

  3. Timeouts: Set timeouts on your HTTP client to prevent requests from hanging forever, which can exhaust resources. You can use the context package to set timeouts.

  4. Reusing HTTP Clients: Reuse HTTP clients between requests to take advantage of persistent connections. This can significantly improve performance by reducing the overhead of establishing new connections.

  5. Using a Worker Pool: For very large scraping jobs, you may want to limit the maximum number of concurrent goroutines to prevent exhausting system resources. You can implement a worker pool using a buffered channel that holds tasks and a fixed number of worker goroutines that pull tasks off the channel.

Comparison to Other Concurrency Approaches

Making concurrent HTTP requests in Go with goroutines is not the only approach to speeding up web scraping. Let‘s briefly compare it to two other common approaches:

  1. Sequential Requests: This is the simplest approach where you make requests one after the other. It‘s easy to implement, but can be very slow for a large number of pages. Sequential requests don‘t utilize available resources efficiently as the CPU sits idle waiting for I/O most of the time.

  2. Thread Pools: Many languages use thread pools to manage a group of pre-allocated threads for concurrent tasks. While this allows for concurrency, the cost of creating and managing threads is much higher than goroutines in Go. Thread pools also typically require more configuration and tuning to get the ideal performance.

Goroutines hit a sweet spot between the simplicity of sequential code and the performance of concurrency, without the overhead of manual thread management. The lightweight nature of goroutines makes them ideal for the kind of I/O-bound workload that web scraping entails.

Conclusion

Concurrent requests are an essential tool for building high-performance web scrapers in Go. By leveraging goroutines to make requests concurrently, you can dramatically speed up your scraping pipeline with minimal complexity.

Remember to be a good citizen by rate limiting your requests and handling errors and timeouts gracefully. With these best practices in mind, go forth and build the fastest scrapers the web has ever seen! Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *