Web scraping has become an essential skill for developers looking to gather data from websites quickly and efficiently. Whether you need to collect product information, monitor prices, generate leads, or aggregate news and articles, web scraping allows you to automate the process of extracting data from web pages at scale.
However, web scraping comes with its own set of challenges. Many websites employ anti-bot measures like CAPTCHAs, IP blocking, user agent detection, and dynamically rendered content that can make scraping difficult. Building a robust scraper that can reliably handle these obstacles is time-consuming and requires ongoing maintenance.
This is where ScrapingBee comes in. ScrapingBee is an API that handles all the complexities of web scraping for you. It takes care of proxy rotation, CAPTCHAs, JavaScript rendering, and retry mechanisms, allowing you to focus on working with the data you need.
In this tutorial, we‘ll walk through how to use ScrapingBee with the Go programming language to easily scrape websites in 2024. By the end, you‘ll have a working web scraper that you can adapt for your own projects. Let‘s get started!
Setting up ScrapingBee
To use ScrapingBee, you‘ll first need to sign up for an account. ScrapingBee offers a free plan that includes 1,000 API credits per month, which is a great way to get started.
Once you‘ve created your account, you‘ll be given an API key. This key is used to authenticate your requests to the ScrapingBee API. Be sure to keep it secret, as anyone with your key will be able to make requests on your behalf.
Installing Go packages
For this tutorial, we‘ll be using the standard net/http
package to make requests, as well as the encoding/json
package to parse JSON responses. These packages are included in the standard library, so no additional installation is required.
Making an API request
To scrape a web page with ScrapingBee, we‘ll send a GET request to the API with our API key and target URL as parameters. Here‘s what that looks like in Go:
package main
import (
"fmt"
"io/ioutil"
"log"
"net/http"
)
func main() {
apiKey := "YOUR_API_KEY"
targetURL := "https://example.com"
client := &http.Client{}
req, err := http.NewRequest("GET", fmt.Sprintf("https://app.scrapingbee.com/api/v1?api_key=%s&url=%s", apiKey, targetURL), nil)
if err != nil {
log.Fatalf("Failed to create request: %v", err)
}
resp, err := client.Do(req)
if err != nil {
log.Fatalf("Request failed: %v", err)
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Fatalf("Failed to read response body: %v", err)
}
fmt.Println(string(body))
}
Be sure to replace YOUR_API_KEY
with your actual ScrapingBee API key.
This code sends a GET request to the ScrapingBee API with your key and the URL you want to scrape. The API will fetch the page, render any JavaScript, and return the HTML content in the response body.
Parsing the HTML
Now that we have the HTML, we need to parse it to extract the data we want. For this, we can use Go‘s standard html
package or an external library like goquery.
Here‘s an example using goquery
to extract all the headlines from a news website:
package main
import (
"fmt"
"io/ioutil"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func main() {
apiKey := "YOUR_API_KEY"
targetURL := "https://news.ycombinator.com"
client := &http.Client{}
req, err := http.NewRequest("GET", fmt.Sprintf("https://app.scrapingbee.com/api/v1?api_key=%s&url=%s", apiKey, targetURL), nil)
if err != nil {
log.Fatalf("Failed to create request: %v", err)
}
resp, err := client.Do(req)
if err != nil {
log.Fatalf("Request failed: %v", err)
}
defer resp.Body.Close()
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatalf("Failed to parse HTML: %v", err)
}
doc.Find(".titlelink").Each(func(i int, s *goquery.Selection) {
headline := s.Text()
fmt.Println(headline)
})
}
This code uses goquery
to parse the HTML returned by ScrapingBee. It then finds all elements with the class .titlelink
, which correspond to the headlines on Hacker News, and prints out each one.
You can modify the selector to extract other elements from the page, like product names, prices, URLs, etc. goquery
provides an easy and familiar jQuery-like syntax for navigating the HTML document. See the goquery docs for more details.
Saving the data
Once you‘ve extracted the data you need, you‘ll likely want to save it for further processing and analysis. You have many options depending on your needs:
- Write to a CSV, JSON, or other structured file format
- Save to a SQL or NoSQL database
- Index in a search engine like Elasticsearch
- Queue for further processing (pubsub, task queue, message bus)
- Expose via an API or website
The specifics will depend on your particular use case and architecture. As an example, here‘s how you might append the scraped headlines to a CSV file:
package main
import (
"encoding/csv"
"fmt"
"io/ioutil"
"log"
"net/http"
"os"
"github.com/PuerkitoBio/goquery"
)
func main() {
apiKey := "YOUR_API_KEY"
targetURL := "https://news.ycombinator.com"
client := &http.Client{}
req, err := http.NewRequest("GET", fmt.Sprintf("https://app.scrapingbee.com/api/v1?api_key=%s&url=%s", apiKey, targetURL), nil)
if err != nil {
log.Fatalf("Failed to create request: %v", err)
}
resp, err := client.Do(req)
if err != nil {
log.Fatalf("Request failed: %v", err)
}
defer resp.Body.Close()
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatalf("Failed to parse HTML: %v", err)
}
file, err := os.OpenFile("headlines.csv", os.O_APPEND|os.O_WRONLY|os.O_CREATE, 0644)
if err != nil {
log.Fatalf("Failed to open csv file: %v", err)
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
doc.Find(".titlelink").Each(func(i int, s *goquery.Selection) {
headline := s.Text()
err := writer.Write([]string{headline})
if err != nil {
log.Printf("Failed to write headline to csv: %v", err)
}
})
}
This opens (or creates) a file called headlines.csv
, then appends each headline to a new row using the encoding/csv
package.
Additional ScrapingBee features
Beyond basic HTML fetching, ScrapingBee provides some powerful features to make scraping even easier:
-
JavaScript rendering: Many modern websites heavily rely on JavaScript to load content dynamically. ScrapingBee can automatically render JavaScript before returning the HTML, so you don‘t need to worry about setting up a headless browser.
-
Rotating proxies: ScrapingBee maintains a large pool of proxies and automatically rotates them with each request. This helps avoid rate limiting and IP bans.
-
CAPTCHA solving: If a website requires solving a CAPTCHA to access, ScrapingBee can automatically solve it for you, saving time and effort.
-
Customizable headers: You can set custom headers like user agent, cookies, and more to make your requests seem more human-like.
See the ScrapingBee docs for how to use these features in your API requests.
Ideas for web scraping projects
Now that you know how to scrape websites with ScrapingBee and Go, what can you build? Here are a few ideas to get you started:
- Price monitoring tool to track prices of products across multiple retailers
- News aggregator to collect articles on specific topics from various sources
- Job board scraper to compile job listings from different websites
- Social media scraper to gather posts and user data for analysis
- Financial data aggregator to collect stock prices, company filings, and market news
- Real estate listing scraper to compare property details across sites
Conclusion
Web scraping is a powerful technique for gathering data from across the internet, and ScrapingBee makes it easy by handling many of the technical challenges for you. By using ScrapingBee with Go, you can quickly build robust and scalable scrapers to collect the data you need.
In this tutorial, we covered how to:
- Set up a ScrapingBee account and get your API key
- Make a request to the ScrapingBee API to fetch a web page
- Parse the HTML response using goquery
- Extract the desired data and save it to a CSV file
We also discussed some of ScrapingBee‘s more advanced features like JavaScript rendering, proxy rotation, and CAPTCHA solving, which allow you to scrape even the most challenging websites.
Armed with this knowledge, you‘re ready to start building your own web scraping projects. The possibilities are endless – from price monitoring to social media analysis to financial modeling. So find a website you want to scrape, and start collecting that data!
Of course, be sure to use your new scraping powers responsibly and respect website terms of service. And if you have any questions or issues, the ScrapingBee docs and support team are there to help.
Happy scraping!