If you‘ve ever tried to scrape a modern website, you‘ve likely run into challenges dealing with infinite scroll. More and more sites are using this technique to load content dynamically as the user scrolls down the page. For web scrapers, this can be problematic since the full page HTML is not loaded upfront. Luckily, there are ways to handle this in your Go web scraping projects. In this guide, we‘ll take an in-depth look at scraping infinite scroll pages in Go.
What is Infinite Scroll?
Infinite scroll (also known as endless scroll or continuous scroll) is a web design technique that loads content continuously as the user scrolls down the page, eliminating the need for pagination. It allows users to view a large amount of content without having to navigate through pages. Websites like social media feeds, search results, and e-commerce product listings often utilize infinite scrolling.
From a technical perspective, infinite scroll is usually implemented with JavaScript. When the user scrolls near the bottom of the page, an event is triggered that fetches more data from the server via an API. The new data is appended to the bottom of the page, allowing the user to continuously scroll.
Why Infinite Scroll is Challenging for Web Scraping
Since infinite scroll pages load data asynchronously with JavaScript, the full HTML of the page is not available when you initially load the page URL. A basic HTTP request will only fetch the initial page HTML, which often only contains a small subset of the total data you want to scrape.
To effectively scrape infinite scroll pages, you need a way to interact with the page like a real user would – scrolling the page and waiting for new data to load. Tools like Puppeteer and Selenium can help automate this, but configuring them adds complexity to your scraping projects.
Luckily, there‘s an easier way! We can use Scrapingbee‘s powerful js_scenario feature to handle the scrolling interaction for us, letting us focus on parsing the data we need.
Scraping an Infinite Scroll Page with Scrapingbee
To demonstrate scraping an infinite scroll page in Go, we‘ll use this demo page: https://demo.scrapingbee.com/infinite_scroll.html
This page simulates a common infinite scroll implementation. It initially loads 9 result boxes, then loads 9 more each time you scroll to the bottom of the page. Our goal is to scrape all of the results from this page.
Setting Up
First, make sure you have a Scrapingbee API key (you can get a free API key here). We‘ll also need the following Go packages:
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
"os"
)
Making the Initial Request
Let‘s first see what happens if we try to scrape the page without handling the infinite scroll:
package main
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
"os"
)
func get_request(api_key string, page_url string) (*http.Response, error) {
// Encode URL
encoded_url := url.QueryEscape(page_url)
// Build request URL
request_url := fmt.Sprintf("https://app.scrapingbee.com/api/v1/?api_key=%s&url=%s", api_key, encoded_url)
// Make request
response, err := http.Get(request_url)
if err != nil {
return nil, err
}
return response, nil
}
func save_html(filename string, html string) error {
// Save HTML to file
err := ioutil.WriteFile(filename, []byte(html), 0644)
if err != nil {
return err
}
return nil
}
func main() {
api_key := "YOUR_API_KEY"
page_url := "https://demo.scrapingbee.com/infinite_scroll.html"
// Get page HTML
response, err := get_request(api_key, page_url)
if err != nil {
fmt.Println("Request failed:", err)
os.Exit(1)
}
// Read response body
body, err := ioutil.ReadAll(response.Body)
if err != nil {
fmt.Println("Failed to read body:", err)
os.Exit(1)
}
// Save HTML to file
err = save_html("no_scroll.html", string(body))
if err != nil {
fmt.Println("Failed to save HTML:", err)
os.Exit(1)
}
}
This sends a GET request to the Scrapingbee API, passing in the URL of the demo infinite scroll page. It then saves the returned HTML to a file named no_scroll.html.
If you open this HTML file in a browser, you‘ll see that it only contains the first 9 results that are initially loaded. To get the remaining results, we‘ll need to scroll the page.
Scrolling with js_scenario
Scrapingbee provides a js_scenario feature that allows you to interact with pages that require JavaScript rendering. With js_scenario, we can simulate scrolling the page multiple times.
Here‘s how to update the previous code to use js_scenario:
func get_request(api_key string, page_url string) (*http.Response, error) {
// Encode URL
encoded_url := url.QueryEscape(page_url)
// Define scroll scenario
scenario := `{
"instructions": [
{"scroll_y": 1080},
{"wait": 500},
{"scroll_y": 1080},
{"wait": 500}
]
}`
encoded_scenario := url.QueryEscape(scenario)
// Build request URL
request_url := fmt.Sprintf("https://app.scrapingbee.com/api/v1/?api_key=%s&url=%s&js_scenario=%s",
api_key, encoded_url, encoded_scenario)
// Make request
response, err := http.Get(request_url)
if err != nil {
return nil, err
}
return response, nil
}
func main() {
// ...
// Get page HTML with js_scenario
response, err := get_request(api_key, page_url)
// ...
// Save HTML to file
err = save_html("with_scroll.html", string(body))
// ...
}
Here‘s what the js_scenario is doing:
- Scroll by 1080 pixels
- Wait 500 ms
- Scroll another 1080 pixels
- Wait 500 ms
After making these changes, run the code again. Now if you open with_scroll.html, you should see 18 result boxes – a big improvement!
The exact scrolling parameters you use will depend on the specific page you‘re scraping. You may need to scroll more or less distance, wait longer between scrolls, or repeat the scroll more times. It‘s a good idea to experiment with different values to see what works best.
Parsing the Data
Now that we have the full page HTML, we can parse out the data we need. For this example, let‘s say we want to extract the title and description from each result box.
To parse HTML in Go, we can use the goquery library. Here‘s how to select the title and description elements:
import "github.com/PuerkitoBio/goquery"
// ...
doc, err := goquery.NewDocumentFromReader(strings.NewReader(string(body)))
if err != nil {
fmt.Println("Failed to parse HTML:", err)
os.Exit(1)
}
doc.Find(".result").Each(func(i int, s *goquery.Selection) {
title := s.Find(".title").Text()
description := s.Find(".description").Text()
fmt.Printf("Result %d:\n", i+1)
fmt.Println("Title:", title)
fmt.Println("Description:", description)
fmt.Println()
})
This uses a CSS selector to find all elements with the result class, then selects the child elements with title and description classes. Run the updated code and you should see the titles and descriptions printed out for all 18 results.
Scaling Up
So far we‘ve scraped a single page, but what if you need to scrape many pages? Here are a few tips:
-
Use goroutines to make multiple requests concurrently. This can speed up your scraping significantly. Just be careful not to overload the target server.
-
Store results in a database or write them to a file as you go, rather than keeping everything in memory. This will help prevent memory issues when scraping large amounts of data.
-
Be mindful of rate limiting. Many websites will block you if you make too many requests too quickly. Use delays between requests and consider proxying your traffic through Scrapingbee to avoid issues.
Conclusion
Infinite scroll pages can seem daunting at first, but with the right tools they‘re actually quite manageable. In this guide, we walked through the process of analyzing an infinite scroll page, using Scrapingbee js_scenario to get the full HTML, and parsing out the data we need in Go.
The same general approach can be used with other languages and libraries as well. For example, in Python you might use the requests library to make the API call and BeautifulSoup to parse the HTML.
With a bit of practice, you‘ll be able to handle any infinite scroll page that comes your way. Happy scraping!