Web scraping, the automatic extraction of data from websites, is an increasingly valuable skill for data professionals. The open-source programming language R provides a powerful and flexible environment for harvesting data from the web.
In this in-depth guide, we‘ll teach you how to scrape the web with R from the ground up. You‘ll learn the fundamentals of web scraping, how to handle different scenarios you‘ll encounter, and how to leverage popular R packages like rvest and Rcrawler. We‘ll also cover best practices for responsible scraping.
By the end, you‘ll have the knowledge and tools to confidently extract data from any website using R. Let‘s dive in!
HTML Basics for Web Scraping
To scrape data from websites, you first need to understand a bit about how web pages are structured with HTML.
HTML uses tags to define elements on a page like <title>
, <p>
, <table>
and so on. Tags are contained in angle brackets and usually come in pairs, with content in between. For example:
<h1>This is a heading</h1>
Browsers parse this HTML to render the page you see. To scrape the data, we need to parse the underlying HTML.
The first step is to read in the HTML from a URL into R. You can use the readLines()
function for this:
url <- "http://example.com"
html <- readLines(url)
This stores the raw HTML as a character vector, one line per element. To extract data, we need to parse this unstructured text into something R can understand.
Parsing HTML with R
Parsing means analyzing the HTML text to identify the structural elements like headings, paragraphs, links and tables.
The rvest
package makes this easy by letting us read HTML directly into a parseable document:
library(rvest)
page <- read_html("http://example.com")
After parsing, we have an XML document that we can extract data from using XPath and CSS selectors.
XPath is a syntax for identifying parts of an HTML document. For example, to find all the links:
html_nodes(page, "//a")
CSS selectors are an alternative way to select elements. To find paragraphs with a certain class:
html_nodes(page, "p.myclass")
The html_nodes()
function takes a parsed document and a selector, and returns all the matching elements.
To get just the text inside an element, use html_text()
:
html_nodes(page, "//h1") %>% html_text()
For attributes, use html_attr()
:
html_nodes(page, "//a") %>% html_attr("href")
With these fundamentals, you‘re ready to start applying web scraping to real scenarios.
Scraping Files from FTP Servers
FTP (File Transfer Protocol) is an old but still widely used protocol for sharing files over the internet.
To list files on an FTP server in R:
url <- "ftp://example.com"
files <- getURL(url, dirlistonly = TRUE)
getURL()
is part of the RCurl package. Setting dirlistonly=TRUE
returns just the file listing.
To filter for only certain files, use regular expressions:
csvfiles <- grep("\.csv$", files, value=TRUE)
And to download:
outdir <- "data"
if (!file.exists(outdir)) dir.create(outdir)
for(file in csvfiles) {
download.file(paste0(url,"/",file), destfile=file.path(outdir,file), mode="wb")
}
This code snippet creates a directory to store the files, then loops through to download each one using download.file()
.
Scraping Data From Wikipedia
Wikipedia has a huge amount of semi-structured information that‘s useful for data science projects. While they offer official data dumps, sometimes it‘s easier to scrape the info you need from specific pages.
First read the page into R:
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
page <- read_html(url)
Next, locate the data you want. The GDP data is in a table, which we can select using CSS notation:
table <- html_nodes(page, "table.wikitable")
Finally, parse the table into a data frame:
gdp <- table %>% html_table() %>% .[[1]]
html_table()
does the heavy lifting, and .[[1]]
selects the first (and only) table on the page. The result is a tidy dataframe ready for analysis.
Wikipedia uses consistent HTML markup which makes it relatively easy to scrape. This isn‘t always the case though – some websites go out of their way to prevent scraping.
Handling Website Authentication and CAPTCHAS
Many sites require login credentials to access data. If you have permission to scrape, you‘ll need to authenticate your requests.
The simplest way is via HTTP authentication:
user <- "username"
pass <- "password"
url <- "http://example.com"
session <- html_session(url, auth=user:pass)
page <- session %>% follow_link("protectedpage")
This logs your scraper in and retrieves the protected page. Note that username and password are provided in plain text – be careful not to expose your credentials.
A trickier problem is CAPTCHAS – those distorted images of text you have to decipher to prove you‘re human. These are designed to stop bots, including scrapers.
Some tactics for dealing with CAPTCHAs:
- See if the site offers an API as an alternative to scraping
- Outsource CAPTCHA solving to a third-party service
- Try to replicate the JavaScript that generates the CAPTCHA images
- As a last resort, manually solve the CAPTCHAs yourself
CAPTCHAs aren‘t an issue if you‘re only scraping a small amount of data, but can quickly become a major roadblock for large scale web scraping projects. Try to find a workaround before resorting to brute force solving.
Scraping Entire Websites with Rcrawler
Want to scrape every page on a site? The Rcrawler package has you covered.
Rcrawler works in two stages:
- Discovering and downloading all pages
- Parsing the downloaded HTML to extract structured data
To download an entire site:
library(Rcrawler)
domain <- "example.com"
sink(paste0(domain,".log"))
Rcrawler(
Website(domain),
no_cores = 4,
no_conn = 4,
ExtractCSSPat = c("div","p"),
ExtractXPathPat = c("//h1","//title")
)
This will crawl the site starting from the homepage, and download all unique internal pages. You can specify CSS and XPath patterns to extract only certain elements.
Once the crawl is complete, parse the downloaded data:
DATA <- ContentScraper(
Url = "example.com",
XpathPatterns = c("//h1","//p","//img/@src","//a/@href"),
encodeUTF8 = TRUE
)
DATA
will contain structured text, links and image sources extracted from the crawled pages. You can then subset, clean and analyze this data.
Be careful when setting crawlers loose – a single crawl can quickly generate thousands of requests. Always check the robots.txt for the site and follow any directives about rate limiting or restricted content.
Responsible Web Scraping
Web scraping is a controversial topic. Just because data is publicly accessible doesn‘t always mean you have permission to extract and use it however you want.
Some general guidelines for responsible scraping:
-
Respect robots.txt. This file specifies what parts of the site are off limits to scrapers.
-
Don‘t hammer servers. Insert delays between your requests to avoid overloading the site.
-
Use caching. Store and reuse previously scraped data instead of requesting the same pages repeatedly.
-
Identify yourself. Use a custom user agent string so site owners can contact you if needed.
-
Don‘t steal content. Republishing scraped content without permission may violate copyright.
-
Use APIs when available. Many websites provide official APIs that are faster and more reliable than scraping.
The legality of web scraping is still a gray area, but being respectful and transparent goes a long way. When in doubt, ask for permission before scraping.
Conclusion
We‘ve covered a lot of ground in this guide to web scraping with R. You should now have a solid foundation in:
- The basics of HTML and how to parse it with R
- Downloading files from FTP servers
- Extracting tables from Wikipedia
- Logging in to websites and handling CAPTCHAs
- Crawling entire domains with Rcrawler
- Best practices for responsible web scraping
Armed with these skills, you can gather data from almost any online source. Just remember that with great power comes great responsibility – always scrape ethically.
To learn more, check out these resources:
- rvest documentation: https://rvest.tidyverse.org/
- Rcrawler project: http://www.sciencedirect.com/science/article/pii/S2352711017300110
- httr for authenticated requests: https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html
Happy scraping!