Web scraping is the process of programmatically extracting data from websites. Instead of manually copying information from web pages, you write code to automate the process, grabbing exactly the data you need, at scale.
Scraped web data powers many business functions, including:
- Market research and competitor monitoring
- Lead generation for sales and marketing
- Aggregating product listings, prices and reviews
- Building databases of company, property or contact info
- Collecting news, articles and social media posts
- Academic and scientific research
The web hosts a vast trove of valuable public data. A 2020 study estimated that web scraping generates 10-30% of all web traffic, and will add $1.7 billion to the global economy by 2027.
Challenges of Web Scraping
While web scraping is extremely useful, it‘s rarely easy, especially at scale. Websites are all structured differently, and many are built using JavaScript frameworks that load data dynamically. This means basic HTTP requests often aren‘t enough to get the data you need.
There are also many anti-bot measures that can block your scrapers:
- IP rate limits and CAPTCHAs to prevent excessive requests
- User agent, cookie and header checks to detect non-human visitors
- Geoblocking and browser fingerprinting to restrict access
Trying to build your own web scraping system that can handle all of this is a huge technical challenge. It requires setting up distributed proxy networks, building browser automation frameworks, and constantly maintaining your scraping code to deal with website changes. Many developers spend more time wrestling with this infrastructure than actually working with the data they collect.
ScrapingBee Makes Web Scraping Easy
ScrapingBee is an API that handles all the hard parts of web scraping for you. It‘s a managed scraping service that lets you extract data from any public web page as if you were visiting it in a real browser, with just an API call.
The key features of ScrapingBee are:
-
JavaScript rendering: ScrapingBee executes JS and waits for dynamic content to load before returning the HTML, so you get fully rendered pages.
-
Rotating proxies and browsers: Every request uses a clean IP address and browser profile, avoiding blocks and CAPTCHAs. You can specify location targeting options.
-
Structured data parsing: ScrapingBee provides AI-assisted parsing to extract data from pages into JSON, CSV or XML formats.
-
Screenshots: Capture PNG screenshots of pages for visual monitoring and debugging.
ScrapingBee has a simple pricing model based on request volume. The free plan supports 1,000 requests per month, and paid plans scale up from there. Enterprise plans are available for high-volume scraping.
Setting Up ScrapingBee with C
To start using ScrapingBee from your C# applications, first sign up for a free account at https://www.scrapingbee.com/register/. You‘ll be given an API key, which you‘ll use to authenticate your requests.
Next, create a new .NET console application:
dotnet new console -o ScrapingBeeDemo
cd ScrapingBeeDemo
The only external dependency is Newtonsoft.Json
for parsing JSON responses:
dotnet add package Newtonsoft.Json
Now you‘re ready to make a request! Add the following code to Program.cs
:
using System;
using System.Net.Http;
using System.Threading.Tasks;
using Newtonsoft.Json.Linq;
class Program
{
private static readonly HttpClient client = new HttpClient();
static async Task Main(string[] args)
{
var apiKey = "YOUR_API_KEY";
var baseUrl = "https://app.scrapingbee.com/api/v1";
var url = "https://www.example.com";
var endpoint = $"{baseUrl}?api_key={apiKey}&url={url}";
try
{
var response = await client.GetAsync(endpoint);
response.EnsureSuccessStatusCode();
var jsonResponse = await response.Content.ReadAsStringAsync();
var json = JObject.Parse(jsonResponse);
var content = json["data"].ToString();
Console.WriteLine(content);
}
catch (HttpRequestException e)
{
Console.WriteLine($"Request failed: {e.Message}");
}
catch (Exception e)
{
Console.WriteLine($"Error: {e.Message}");
}
}
}
Remember to replace YOUR_API_KEY
with your actual key from the ScrapingBee dashboard.
This sends a GET request to the ScrapingBee API, passing your key and the URL to scrape. The response is a JSON object containing the page HTML in the data
field, which is extracted and printed out.
When you run this with dotnet run
, you should see the full HTML of https://www.example.com printed to the console. That‘s it – you just made your first scrape with ScrapingBee!
Extracting Structured Data
Of course, you‘ll usually want to extract specific structured data from pages, not just dump the raw HTML. ScrapingBee supports this via the extract_rules
parameter, which accepts CSS or XPath selectors to pull out page elements.
For example, here‘s how you could scrape just the page title and first paragraph from an article:
var url = "https://en.wikipedia.org/wiki/Web_scraping";
var extractRules = new Dictionary<string, string>
{
["title"] = "h1",
["intro"] = "p:first-of-type"
};
var rulesJson = JsonConvert.SerializeObject(extractRules);
var endpoint = $"{baseUrl}?api_key={apiKey}&url={url}&extract_rules={rulesJson}";
var response = await client.GetAsync(endpoint);
var jsonResponse = await response.Content.ReadAsStringAsync();
var json = JObject.Parse(jsonResponse);
var title = json["title"][0].ToString();
var intro = json["intro"][0].ToString();
Console.WriteLine($"Title: {title}");
Console.WriteLine($"Intro: {intro}");
Here the extract_rules
parameter is a JSON object mapping names to CSS selectors. The extracted data is returned in the response object under the corresponding names.
You can also use the extract_metadata
option to analyze the page with AI and automatically parse out key fields like article titles, descriptions, authors, dates, images, prices and more, without needing to specify selectors.
Paginated Scraping
Many websites spread lists of data over multiple pages for easier browsing. To scrape data from all pages, you‘ll need to parse the links and make separate requests for each page.
Here‘s an example of scraping a paginated ecommerce product list:
var baseUrl = "https://app.scrapingbee.com/api/v1";
var apiKey = "YOUR_API_KEY";
var products = new List<Product>();
var pageNum = 1;
while (true)
{
var url = $"https://example.com/products?page={pageNum}";
var endpoint = $"{baseUrl}?api_key={apiKey}&url={url}";
var response = await client.GetAsync(endpoint);
var jsonResponse = await response.Content.ReadAsStringAsync();
var json = JObject.Parse(jsonResponse);
var html = json["data"].ToString();
var doc = new HtmlDocument();
doc.LoadHtml(html);
var productNodes = doc.DocumentNode.SelectNodes("//div[@class=‘product‘]");
foreach (var node in productNodes)
{
var name = node.SelectSingleNode(".//h3").InnerText;
var price = node.SelectSingleNode(".//span[@class=‘price‘]").InnerText;
products.Add(new Product { Name = name, Price = price });
}
var nextPageLink = doc.DocumentNode.SelectSingleNode("//a[@class=‘next‘]");
if (nextPageLink == null)
break;
pageNum++;
}
foreach (var product in products)
{
Console.WriteLine($"{product.Name}: {product.Price}");
}
class Product
{
public string Name { get; set; }
public string Price { get; set; }
}
The scraper loops through each page until no "next" link is found, extracts the name and price of products on the page, and adds them to a master list.
Note: This example uses the HtmlAgilityPack
library for parsing the HTML. You‘ll need to add it to your project with dotnet add package HtmlAgilityPack
.
JavaScript Rendering
Modern websites make heavy use of JavaScript to load data dynamically and update the UI. If you try to scrape a JS-heavy site using standard HTTP requests, you‘ll often get back incomplete page contents.
To scrape these sites, you need to use a headless browser that can execute the JavaScript on the page and wait for dynamic content to load before extracting the HTML. ScrapingBee handles this automatically when you set the render_js
parameter to true
:
var url = "https://example.com";
var endpoint = $"{baseUrl}?api_key={apiKey}&url={url}&render_js=true";
With this option, ScrapingBee will render the page in a full browser environment and return the final HTML after all scripts have run. It will also capture screenshots of the rendered page if you set screenshot=true
.
Geotargeting
Some websites display different content to visitors from different countries. This is often done for localization, compliance with regional laws, or to restrict access to certain areas.
If you need to scrape a site from a specific location, you can use ScrapingBee‘s geotargeting options. Set the country_code
parameter to the 2-letter code of the country you want to access the site from:
var url = "https://example.com";
var endpoint = $"{baseUrl}?api_key={apiKey}&url={url}&country_code=US";
ScrapingBee will route the request through a proxy in the specified country, so the target website will see it as originating there.
Stealth Setup
Websites employ various methods to detect and block scrapers. These include checking request headers, cookies, browser properties, and IP reputation.
To avoid triggering bot countermeasures, ScrapingBee offers several "stealth" settings:
premium_proxy=true
uses premium proxies optimized for scrapingrandom_user_agent=true
rotates user agents for each requestcookies=keep
preserves cookies between requests to simulate sessionswindow_width
andwindow_height
set viewport dimensions to match real devices
For example:
var url = "https://example.com";
var endpoint = $"{baseUrl}?api_key={apiKey}&url={url}&premium_proxy=true&random_user_agent=true&cookies=keep&window_width=1366&window_height=768";
These settings help you blend in as a normal user and avoid many crude anti-bot checks. However, you should still be mindful of your scraping rate and volume to avoid excessive load on target sites.
Scraping Best Practices and Ethics
When scraping websites, it‘s important to do so responsibly and ethically. Some key principles:
-
Respect robots.txt: This file specifies rules for which pages can be scraped. ScrapingBee won‘t scrape pages disallowed by robots.txt unless you explicitly override it.
-
Limit request rate: Aggressive scraping can overload websites and disrupt service for real users. Use delays between requests and limit concurrent connections.
-
Don‘t scrape sensitive data: Avoid scraping content that is private, personal or copyrighted without permission.
-
Check terms of service: Some sites expressly prohibit scraping in their TOS. Be aware of the legal implications of scraping such sites.
-
Use data responsibly: Ensure that your use of scraped data complies with relevant laws and regulations, such as GDPR, CCPA, etc.
While scraping public data is generally legal, some companies are increasingly using legal and technical means to restrict scraping. It‘s important to consult with legal counsel to ensure your scraping activities are compliant.
Conclusion
Web scraping is a powerful tool for extracting data from websites at scale. ScrapingBee makes it easy to build scrapers in C# by handling all the tricky parts like JavaScript rendering, proxy rotation, and CAPTCHAs. Using the techniques covered in this guide, you can collect data from almost any public website with just a few lines of code.
Some key takeaways:
- Use CSS and XPath selectors to pinpoint the data you want to extract
- Set the
render_js
option to scrape sites built with JavaScript frameworks - Iterate through paginated results by following "next" links
- Geotarget requests for sites with localized content
- Use "stealth" options to avoid bot detection
- Scrape ethically by respecting robots.txt, rate limiting, and terms of service
Here are some further resources to continue learning:
- ScrapingBee API Reference
- ScrapingBee XPath and CSS Selector Cheat Sheet
- C# HTML Parsing With HAP
- OWASP Web Scraper Guide
With tools like ScrapingBee and the power of C#, you can build robust web scrapers to power your business with the data you need. Happy scraping!