The internet is the largest datasource in the world. Companies like Google, Facebook and Amazon have built multi-billion dollar businesses by collecting and analyzing web data. With over 4.5 billion indexed web pages, the web contains endless publicly available data on every imaginable topic. Web scraping allows you to extract these datasets into a structured format for further analysis.
In this comprehensive guide, you‘ll learn how to use R and rvest to extract key data from web pages into analyzable data frames.
The Growing Importance of Web Scraped Data
Web scraping has seen massive adoption over the last decade. Here are some stats that showcase its rise:
- 76% of organizations rely on web scraped data for business intelligence according to Parsehub.
- The web scraping market is projected to grow at over 20% CAGR to reach USD 4 billion by 2026 according to Mordor Intelligence.
- 36% of companies use web data for price monitoring, 29% for market research and 23% for lead generation according to Parsehub.
There are several reasons for this growth:
- Exponential growth of web pages: There are over 1.7 billion websites today based on Internet Live Stats data. The amount of information online is growing exponentially.
- Need for unstructured datasets: Much of the web‘s data is unstructured text, images, videos and other multimedia formats. This data holds immense analytical value if extracted and processed properly.
- Real-time insights: The web provides up-to-date data. Static datasets rapidly go stale. Web scraping allows access to real-time data.
- 360 degree view of entities: By combining data from multiple sites, web scraping can build a comprehensive view of products, companies, industries and more.
Let‘s look at some examples of how organizations use web scraped data:
Use Case | Data Sources |
---|---|
Pricing Optimization | Scraped prices of competitor products from e-commerce sites |
Market Sizing | Collect industry news, press releases, job listings across sector sites |
Lead Generation | Extract business contact info from directories and company sites |
SEO Monitoring | Track keyword ranks from Google results |
Social Media Monitoring | Analyze brand mentions, trends from Twitter, Reddit, Facebook |
Product Research | Scrape Amazon reviews and ratings for sentiment analysis |
This data helps uncover insights to build competitive advantage. The potential use cases are unlimited.
Now that we‘ve seen why web data is so valuable, let‘s see how to extract it using R and rvest.
Overview of Web Scraping Libraries in R
While Python dominates web scraping, R also provides robust capabilities due to its strong ecosystem of data analysis packages:
rvest – R‘s most popular web scraping package inspired by Python‘s BeautifulSoup. Provides simple scraping APIs.
RSelenium – Browser automation for scraping complex JavaScript sites.
xml2 – Low level XML parser used internally by rvest.
httr – Provides a nice R interface for creating HTTP requests and handling proxies.
Rcrawler – Framework for writing higher level scraping workflows.
Rscraping – Helper functions for cleaning and structuring scraped data.
For most use cases, rvest and RSelenium will be sufficient. Let‘s look at rvest first for scraping static pages.
Getting Started with rvest
rvest provides simple CSS selector and XPath interfaces for scraping HTML and XML. Let‘s walk through a hands-on example.
First, we‘ll need to install rvest and httr:
install.packages("rvest")
install.packages("httr")
Now load the libraries:
library(rvest)
library(httr)
Sending GET Requests
To fetch a page, we use the read_html()
function:
url <- "http://webcode.me"
page <- read_html(url)
This sends a GET request to the URL and returns a html_document
object containing the parsed HTML code.
To use a proxy, set it in httr:
# Set proxy
set_config(use_proxy(url="http://192.168.0.1:8080"))
page <- read_html(url)
rvest has no in-built timeout handling. For timeouts, use httr‘s GET()
function which takes a timeout()
parameter:
page <- GET(url, timeout(10)) %>% read_html()
Extracting HTML Elements
To extract elements, use the html_elements()
function. Pass CSS selectors or XPath expressions to it:
# CSS Selector
titles <- page %>%
html_elements(css = ".post h2")
# XPath
author <- page %>%
html_elements(xpath="//span[@class=‘author‘]")
Extract attributes with html_attr()
and text with html_text()
:
# Extract href attribute
link <- links[1]
href <- link %>% html_attr("href")
# Extract text
text <- link %>% html_text()
For tabular data, use html_table()
:
tables <- page %>%
html_elements(css = "table")
df <- tables[[1]] %>%
html_table()
This provides a quick way to convert HTML tables into analyzable data frames.
Handling Paginated Sites
For sites that use pagination or "Load More" buttons, extract the URLs from navigation elements and iterate through them:
# Get list of links
urls <- page %>%
html_elements(css="ul.pagination li a") %>%
html_attr("href")
# Iterate over URLs
results <- map(urls, function(url) {
pg <- read_html(url)
# Extract data
titles <- pg %>%
html_elements(css="h2") %>%
html_text()
return(titles)
}) %>% flatten()
This allows you to scrape sites like Wikipedia with multiple pages of content.
Working with APIs
Many sites provide data through JSON APIs that can be scraped.
Let‘s scrape product data from a sample API:
# Read API response
api_url <- "https://example.com/products"
prod_data <- jsonlite::fromJSON(api_url)
# Extract fields
prices <- prod_data$prices
product_ids <- prod_data$product_ids
The jsonlite package helps parse JSON into R lists and data frames.
Scraping JavaScript Sites with RSelenium
For JavaScript generated sites, RSelenium provides a Selenium based browser automation framework.
Let‘s see how to scrape data from a dynamic JavaScript site.
First install RSelenium:
install.packages("RSelenium")
Now launch a browser instance:
# Launch headless browser
library(RSelenium)
driver <- rsDriver(browser = "chrome", headless = TRUE)
remDr <- driver[["client"]]
# Navigate
remDr$navigate("https://dynamicpage.com")
This launches a Chrome browser instance controlled by Selenium. Set headless = TRUE
to hide the browser window.
Much like rvest, we use CSS selectors and xpath to find elements:
# CSS
titles <- remDr$findElements("css", "h2.title")
# XPath
ratings <- remDr$findElements("xpath", "//span[@class=‘rating‘]")
Extract attributes and text accordingly:
# Get text
title_text <- titles[[1]]$getElementText()
# Get attribute
author <- articles[[1]]$getElementAttribute("data-author")
Finally, once the data has been extracted, export it:
# Construct data frame
pages_df <- data.frame(
title = title_text,
rating = rating_text
)
# Export to CSV
write.csv(pages_df, "data.csv")
This allows you to scrape complicated sites built with frameworks like Angular, React and Vue.js.
Tips for Effective Web Scraping
Here are some tips to scrape efficiently and avoid issues:
- Add delays – Don‘t send requests too fast to avoid getting blocked. Add delays of 5-10 seconds between pages.
- Randomize user agents – Rotate user agent strings so requests seem more organic.
- Check robots.txt – Avoid scraping pages blocked in the site‘s robots.txt file.
- Use proxies – Proxies make your traffic seem more distributed. Helpful for large scraping projects.
- Scrape during low-traffic hours – Hit servers during off-peak hours to minimize impact.
- Distribute scraping – Spread workload over multiple instances for large projects.
- Cache requests – Use caching to avoid repeat requests for unchanged data.
Adopting best practices ensures your scraping proceeds smoothly without hindrances.
Comparing rvest to Python‘s BeautifulSoup
How does rvest compare to BeautifulSoup, the leading Python web scraping library?
Functionality – Both provide similar CSS selector based extraction APIs. rvest has XPath while BeautifulSoup has custom methods like find_all()
.
Ease of use – BeautifulSoup‘s API is simpler and better documented. rvest requires more R programming knowledge.
Performance – In benchmarks, BeautifulSoup is 10-15% faster at parsing large HTML documents.
Language Suitability – For existing R users, rvest will feel very familiar. Python developers may prefer BeautifulSoup.
So in summary, BeautifulSoup may have a slight edge for web scraping due its maturity and dedicated purpose-built API. But for analysts already proficient in R, rvest provides a highly capable scraping solution.
Applying Web Scraping to Build a Competitive Price Monitor
Let‘s apply what we‘ve learned to build a price monitoring tool. It will:
- Scrape product listings from e-commerce sites
- Extract key data like price, ratings, number of reviews
- Generate a consolidated view across competitors
We can run this periodically to monitor price movements and competitive landscape.
The script will:
- Use rvest to scrape search results pages for a product keyword.
- Extract relevant fields into a data frame.
- Combine results from all sites into a master dataframe.
- Calculate minimum, maximum and average price.
Here is a code snippet:
# Search URLs
urls <- c(
"http://amazon.com/s?k=laptop",
"http://walmart.com/search?q=laptop",
...
)
# Extract data
results <- map(urls, function(url) {
page <- read_html(url)
titles <- page %>%
html_elements(css = "h2") %>%
html_text()
prices <- page %>%
html_elements(css = ".price") %>%
html_text()
df <- data.frame(title, price)
return(df)
})
# Combine all results
master_df <- bind_rows(results)
# Calculate stats
print(paste("Min Price:", min(master_df$price)))
print(paste("Max Price:", max(master_df$price)))
print(paste("Avg Price:", mean(master_df$price)))
This gives us a consolidated view of the competitive landscape for pricing and product analysis. We can further enrich it by adding ratings, reviews and other fields.
The entire process from data collection to analysis was performed using R without needing to export to other tools. This demonstrates the end-to-end power of web scraping using rvest and R‘s tidyverse ecosystem.
Conclusion
In this guide, you learned:
- Why web scraping is invaluable – The web contains endless troves of publically available data for analysis.
- How to use rvest to scrape static pages – rvest provides a simple API using CSS selectors and XPath.
- Techniques for handling JavaScript sites with RSelenium – RSelenium automates a real browser using Selenium.
- How to build datasets for analysis – Use data frames to collate extracted fields.
- Best practices for robust scraping – Avoid blocks by randomizing user agents, adding delays and using proxies.
Web scraping unlocks the vast potential of web data. Combined with R‘s strong analytical capabilities, it becomes even more powerful.
This guide provides you the tools to start building domain-specific datasets tailored to your needs by extracting data from websites relevant to your use case. The possibilities are endless!