Web scraping is the process of programmatically retrieving and parsing data from websites. While many sites offer APIs for accessing their data in a structured format, not all do. And sometimes you may need to extract information that isn‘t available through an API. In these cases, web scraping can be a powerful tool to gather the data you need.
Rust is a systems programming language that is well-suited for web scraping tasks. Its high performance, safety guarantees, and rich ecosystem of libraries make it a compelling choice. Two of the most popular and full-featured Rust libraries for web scraping are Reqwest for making HTTP requests and Scraper for parsing HTML.
In this guide, we‘ll walk through how to use Reqwest and Scraper to scrape data from websites in Rust. We‘ll start with the basics of making requests and extracting data, then explore more advanced techniques and best practices. Whether you‘re new to Rust or web scraping, this guide will equip you with the knowledge you need to scrape effectively.
Setting Up a Rust Web Scraping Project
First, make sure you have Rust installed by following the official installation guide.
Next, create a new Rust project:
cargo new web-scraper
cd web-scraper
Add the Reqwest and Scraper dependencies to your Cargo.toml
file:
[dependencies]
reqwest = { version = "0.11", features = ["blocking"] }
scraper = "0.12.0"
We enable the "blocking" feature of Reqwest which allows us to write straightforward synchronous code. For more advanced use cases, Reqwest also supports asynchronous requests.
Making HTTP Requests with Reqwest
The foundation of web scraping is making HTTP requests to fetch the HTML content of web pages. Reqwest makes this easy in Rust.
To fetch the HTML of a page, we can use a simple GET request:
use std::error::Error;
fn main() -> Result<(), Box> {
let response = reqwest::blocking::get("https://www.wikipedia.org")?
.text()?;
println!("Response: {}", response);
Ok(())
}
This sends a GET request to wikipedia.org, retrieves the response body as text, and prints it out. The ?
operator is used for error propagation – if the request fails for some reason, it will return the error.
You can also configure the request by chaining methods:
let response = client
.get("https://www.wikipedia.org")
.header("User-Agent", "my-awesome-scraper/1.0")
.timeout(Duration::from_secs(10))
.send()?;
Setting a custom User-Agent header is often a good idea to identify your scraper. And setting a timeout ensures your program doesn‘t hang if the request takes too long.
Parsing HTML with Scraper
Once you‘ve retrieved the HTML, the next step is parsing it to extract the desired data. This is where Scraper comes in. Scraper provides a convenient and fast way to parse HTML and query it using CSS selectors.
To parse the HTML:
let html = reqwest::blocking::get("https://en.wikipedia.org/wiki/Rust_(programming_language)")
.expect("Could not make request")
.text()
.expect("Could not read response text");
let document = scraper::Html::parse_document(&html);
We can then query elements using CSS selectors:
let title_selector = scraper::Selector::parse("h1").unwrap();
let title = document.select(&title_selector).next().expect("Could not find title");
println!("Title: {}", title.text().collect::());
This finds the first <h1>
element and prints its text content. The text()
method returns an iterator over the text nodes, so we collect it into a String.
You can use more advanced CSS selectors to fine-tune your queries:
let paragraphs = document.select(&scraper::Selector::parse("div.mw-parser-output > p").unwrap());
for paragraph in paragraphs.take(3) {
println!("{}", paragraph.text().collect::());
}
This selects the first 3 paragraph elements that are direct children of <div class="mw-parser-output">
.
Scraper also provides methods for navigating the DOM tree, such as next_sibling()
, parent()
, and children()
.
Putting It All Together
Let‘s combine what we‘ve learned to scrape a list of Rust books from Wikipedia:
use scraper::{Html, Selector};
fn main() {
let html = reqwest::blocking::get("https://en.wikipedia.org/wiki/Rust_(programming_language)")
.expect("Could not make request")
.text()
.expect("Could not read response text");
let document = Html::parse_document(&html);
let book_selector = Selector::parse("div.hatnote i a").expect("Could not create selector");
let books: Vec<String> = document
.select(&book_selector)
.map(|book| book.text().collect())
.collect();
println!("Rust books mentioned on Wikipedia:");
for book in &books {
println!("- {}", book);
}
}
This queries all book links inside <div class="hatnote">
elements, maps them to their title text, collects into a Vector, and prints them out.
Advanced Techniques
There are many ways to expand on this basic scraping functionality:
-
Logging in: Many sites require authentication. You can log in by sending a POST request with your credentials and storing the returned cookies for subsequent requests.
-
Saving to files: Write scraped data to CSV, JSON, or a database for further analysis and processing.
-
Error handling and retrying: Web scraping can be fragile. Build retry logic to handle failed requests and timeouts gracefully.
-
Concurrent requests: Speed up your scraper by making multiple requests in parallel with multithreading or async code.
-
Browser automation: Some heavily JavaScript-rendered sites are difficult to scrape. Tools like Fantoccini allow automating real browsers.
Always be respectful when scraping. Scrapers can put a heavy load on web servers. Add delays between requests, cache when possible, and respect robots.txt. Get permission when scraping non-public data.
Conclusion
Rust is a powerful language for web scraping. Its speed, safety, and expressive libraries like Reqwest and Scraper make it well-suited for building robust and efficient scrapers.
This guide covered the fundamentals of web scraping with Rust, from making HTTP requests to parsing HTML and extracting data with CSS selectors. It also touched on more advanced topics and best practices.
To learn more, check out the Reqwest and Scraper docs, as well as other helpful Rust web scraping tutorials. With its strong ecosystem and active community, Rust web scraping is accessible and rewarding to learn.