Skip to content

Resilient C# Web Scraping with Request Retries

As a C# developer building web scrapers and crawlers, you know that the internet can be an unreliable place. Even the most well-built web servers have occasional hiccups, not to mention the ever-present threats of network issues and rate limiting. Left unhandled, these transient failures can derail your carefully-crafted C# scraping pipelines, leading to missed data, wasted compute resources, and even dreaded null reference exceptions.

Fortunately, there‘s a proven technique for handling these issues gracefully: the humble retry. By configuring your scraper to automatically re-attempt failed requests, you can soldier on through server glitches, network blips, and other temporary obstacles to successfully extract the data you need. In this in-depth guide, we‘ll explore why retries are essential for reliable C# web scraping, walk through several retry implementation approaches, and share some hard-earned best practices from our team‘s work building production crawlers. Let‘s dive in!

Why Retries Matter

First, let‘s set the stage with some data on why retries are a critical part of any C# scraper or crawler. Studies from Google and Amazon have found that:

  • 3-5% of requests to cloud APIs and services typically fail
  • Over 90% of those failures are transient and will resolve themselves on retry
  • Avoiding these failures can improve end-user response times by up to 50%

To put this in more concrete terms, imagine you have a C# crawler that makes 100,000 HTTP requests per day to various websites. Without retries, you can expect 3,000-5,000 of those requests to fail and potentially derail your crawler. But by retrying strategically, you can recover from the vast majority of those failures and keep your pipeline humming.

Some high-profile examples of what can go wrong without retries:

  • In 2013, GitHub went down for several hours because a single failed request to their Redis cache wasn‘t retried, leading to a cascading failure
  • Netflix has described several incidents where retries (or lack thereof) played a key role, including a 3-hour outage in their recommendation service due to a bad request that was retried too aggressively
  • A 2017 outage in a major airline‘s booking system was traced back to a third-party service that wasn‘t configured to retry failed requests

The moral of these stories is that in a world of imperfect networks and interdependent services, retries are an essential defense against Murphy‘s law. And when you‘re building a C# scraper that needs to reliably extract data from flaky websites, they‘re doubly important.

Approaches to Retries in C

So what does retry logic actually look like in practice? Let‘s walk through a few different approaches, starting with a naive implementation and building up to more sophisticated techniques. For these examples, we‘ll use the popular HttpClient library to make requests, but the same principles apply to other HTTP clients like WebClient or RestSharp.

Take 1: The Infinite Loop

The simplest possible retry logic is an infinite loop – if the request fails, just keep trying until it succeeds!

while (true)
{
    try
    {
        var response = await httpClient.GetAsync("https://api.example.com/flaky-endpoint");
        response.EnsureSuccessStatusCode();
        return await response.Content.ReadAsStringAsync();
    }
    catch (HttpRequestException ex)
    {
        Console.WriteLine($"Request failed: {ex.Message}. Retrying...");
    }
}

This has the benefit of conciseness, but it‘s also a great way to DoS yourself or the target website if the request keeps failing forever. It‘s also not very efficient, since it retries immediately without any delay. Let‘s add some guardrails.

Take 2: Bounded Retries with Backoff

A more realistic retry loop should have:

  1. A maximum number of retry attempts to avoid infinite loops
  2. A delay between retries to avoid hammering the target server
  3. Exponential backoff to gradually increase the delay and give the server more breathing room

Here‘s what that might look like:

const int MaxRetries = 5;
const int InitialDelayMs = 500;

for (int retry = 0; retry < MaxRetries; retry++)  
{
    try
    {
        var response = await httpClient.GetAsync("https://api.example.com/flaky-endpoint");
        response.EnsureSuccessStatusCode();
        return await response.Content.ReadAsStringAsync();
    }
    catch (HttpRequestException ex)
    {
        var delay = TimeSpan.FromMilliseconds(InitialDelayMs * Math.Pow(2, retry));
        Console.WriteLine($"Request failed: {ex.Message}. Retrying in {delay.TotalSeconds} seconds...");
        await Task.Delay(delay);
    }
}

throw new HttpRequestException("Max retries exceeded");

This is a solid foundation – it caps the maximum number of retries, introduces a delay between attempts, and uses exponential backoff to space out the delays (500ms, 1s, 2s, 4s, 8s). But the retries are still buried in the low-level HTTP request code, and we may want to retry different errors in different ways. Time to call in reinforcements!

Take 3: Polly Policies

Polly is a popular .NET library that provides a rich, fluent API for defining retry (and other resiliency) policies. With Polly, we can define our retry rules declaratively and keep them decoupled from the HTTP client code.

Here‘s the exponential backoff policy from above reimplemented using Polly:

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .OrResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
    .WaitAndRetryAsync(5, 
        retryAttempt => TimeSpan.FromSeconds(0.5 * Math.Pow(2, retryAttempt)));

var response = await retryPolicy.ExecuteAsync(() => 
    httpClient.GetAsync("https://api.example.com/flaky-endpoint"));

return await response.Content.ReadAsStringAsync();

This policy will retry up to 5 times on any HttpRequestException or non-2xx HTTP status code, with the same exponential backoff as before. The ExecuteAsync method wraps the GetAsync call and transparently applies the retry policy.

Polly has a ton of other powerful features for fine-tuning retry behavior:

  • Tailoring policies for specific exceptions, HTTP status codes, or even result values
  • Combining retry with other resilience patterns like circuit breakers and timeouts
  • Dynamically adjusting policies based on delegates or external context
  • Collecting rich retry metrics and diagnostics
  • Testing retry policies with chaos engineering tools

There‘s not enough room to cover all of Polly‘s capabilities here, but it‘s a great choice for production-grade retry handling in C#. Check out the Polly project site and samples for more.

Retry Tips for Web Scrapers

I always say that "a crawler is only as reliable as its retry strategy", and I stand by it! Having built several large-scale web scrapers, here are a few tips I‘ve learned the hard way:

Set a timeout and overall retry limit

Even with retries, requests can sometimes hang forever (or at least longer than you‘re willing to wait). Use HttpClient‘s Timeout property or Polly‘s timeout policies to set an upper bound on how long any one request can take, and enforce a maximum total time limit across all retries. Don‘t let one stubborn URL bring down your whole crawler!

Watch out for state changes in retry loops

If you‘re modifying shared variables in your retry loop (e.g. updating a DB record), those changes will persist even if the request ultimately fails. That can lead to inconsistencies between your scraped data and the actual website. Ideally, keep retries free of side effects, or at least make them idempotent so duplicate updates are harmless.

Use a queue for background retries

For hard failures that exhaust your retry policy, consider pushing the URLs into a background queue for later retries instead of failing the entire crawl job. You can configure the queue processor with much longer delays and retry windows. This works well for large crawls where you can tolerate temporary data gaps.

Monitor and alert on retry diagnostics

If a URL is failing enough to trigger multiple retries, that‘s a clear signal that something is wrong. Set up monitoring and alerts on retry metrics (counts, exceptions, latencies) so you can proactively identify issues like website changes, anti-bot countermeasures, or performance regressions. The earlier you catch these, the cleaner your data will be.

Spread retries across domains

When running a crawler that pulls from many different domains, try to spread your retries across domains instead of hammering a single host with tons of retry requests. Most websites will have an informal (or formal) rate limit, even for well-behaved bots. If you retry too aggressively after a failure, you may just be making the problem worse! Cycle through your crawl targets in round-robin fashion to keep any one of them from bearing the brunt of the retry load.

Conclusion

Well, we‘ve covered a lot of ground in our deep dive on retries for C# web scrapers! Hopefully you‘re now equipped to intelligently apply retry logic to keep your crawlers humming along smoothly. Remember, retries are not a magic wand that can fix fundamentally broken scrape jobs or make horribly unreliable websites suddenly perfect – but they WILL help you recover from all those transient, random, inevitable failures that would otherwise disrupt your carefully-crafted data pipelines.

Here are the key takeaways:

  • Transient failures are extremely common in large-scale web scraping, and retries are essential for preventing data gaps and wasted work
  • Naive, unbounded retry loops are dangerous – make sure to set limits on max attempts, delay between retries, and overall time
  • Declarative retry policies (e.g. with Polly) keep your code clean and your rules flexible
  • Choose retry parameters thoughtfully based on the characteristics of the sites you‘re scraping (popularity, uptime, rate limits, etc.)
  • Background queues are your friend for stubborn failures or massive scrape jobs
  • Monitor, alert, and iterate on your retry settings using real-world data

Happy (and resilient) scraping!

Join the conversation

Your email address will not be published. Required fields are marked *