Web scraping and web crawling allow extracting huge amounts of public data from websites – prices, reviews, news articles, research data, and more. This data powers competitive intelligence, market research, monitoring tools, AI datasets, and countless other business use cases.
However, website owners don‘t always take kindly to automated scraping of their sites. Detecting and blocking scrapers has become standard practice for many sites hoping to avoid largescale data harvesting.
Getting blocked mid-scrape can completely derail a data collection project. But with the right tools and techniques, it is possible to scrape under the radar without tripping bot protections.
In this comprehensive guide, we‘ll share proven methods to help you answer the question – how to crawl a website without getting blocked? Follow these tips, and you can extract the data you need without disruptions.
The Rising Threat of Bot Blocking
Bot blocking is growing increasingly common. According to Imperva research, over 25% of all website traffic now comes from unwanted bots. Many are malicious bots spreading spam and malware, but a portion are well-intentioned scrapers and crawlers.
Sites have responded fiercely to limit automated scraping with a range of bot protection methods:
Blocking Method | % of Sites Using It |
---|---|
IP Blocking | 33% |
CAPTCHAs | 29% |
Access Behavior Analysis | 19% |
Machine Learning Algorithms | 15% |
Device Fingerprinting | 13% |
With billions lost to content scraping, travel fare aggregation, and data harvesting, websites now invest heavily in technical safeguards.
To overcome these defenses, scrapers need to carefully mask their activities and tread lightly. Next we‘ll explore proven techniques to avoid blocks while crawling.
Checking the Robots.txt File
The first place to check before scraping any site is the robots.txt
file. This file contains rules for bots crawling the site, listing any restricted pages and crawl delay requirements.
You can typically find it at examplesite.com/robots.txt
. For instance, here are some sample rules from Reddit‘s robots.txt
:
# All bots should crawl Reddit with care as outlined in Reddit‘s API Terms of Use
# Bots that don‘t crawl with care will be banned
User-Agent: *
Crawl-delay: 10
User-Agent: Twitterbot
Disallow: /r/
This tells bots to wait 10 seconds between requests, and blocks Twitterbot from certain subreddit pages.
If your bot is disallowed entirely in robots.txt
, you will have to scrape the site from a different IP address not already banned. Otherwise, abide by any crawl delay limits or access restrictions stated.
While respecting robots.txt
is no guarantee against blocks, it shows good faith and helps avoid easy detection.
Masking Your Identity with Proxies
Proxies add a middleman hop between your scraper IP and the target site. Instead of hitting the site directly, requests route through the proxy server first.
This hides your real IP address from the site, making it far harder to detect and block a single scraper source. Using many rotating proxies gives the appearance of traffic coming from multiple users around the world.
There are two main types of proxies to consider:
Residential proxies use IPs from real desktops and mobile devices in homes and businesses. Because the IP geolocation, ISP, and usage patterns match real humans, residential proxies provide the most stealth when scraping sites aggressively.
Datacenter proxies come from hosted servers in datacenters specifically leased for proxying purposes. They are faster than residential IPs, but act more like machines than natural users.
Here‘s a comparison between the two proxy types:
Residential Proxies | Datacenter Proxies | |
---|---|---|
IP Sources | Home & business internet connections | Dedicated proxy servers |
Speed | Medium | Very Fast |
Bot Detection Resistance | Very high | Medium |
Cost | $50+ per GB of traffic | $1-$10 per GB |
Use Case | Heavy scraping of sites with advanced bot defenses | General crawling of many sites |
For basic crawling needs, datacenter proxies offer good performance at lower costs. But for heavily crawled sites, residential proxies are worth the premium for smooth scraping without disruptions.
Popular proxy services include Oxylabs, GeoSurf, Luminati, and Smartproxy. Using robust tools like these is far easier than configuring your own proxy servers.
Here‘s a Python example using the requests module with a rotating proxy:
import requests
from proxy_rotator import ProxyRotator
rotator = ProxyRotator(‘username‘, ‘password‘, ‘ip_list.txt‘)
proxy = rotator.get_proxy() # Get next proxy
proxies = {
‘http‘: f‘http://{proxy}‘,
‘https‘: f‘https://{proxy}‘
}
requests.get(‘https://example.com‘, proxies=proxies)
The key is cycling through many quality, dedicated proxies and not reusing them excessively. This makes blocking any single proxy ineffective.
Realistic User Agents Are a Must
The user agent provides information about the browser, OS, and any custom client. Web servers analyze these strings to identify suspicious values associated with bots and scrapers.
Common signs of a crawler user agent:
- Unusual browser or OS names
- Old browser versions that are rarely used
- Missing browser version and OS info
- "Python", "Scrapy" or other terms that give away the client
- Repeated identical user agent values
Instead, your bot should masquerade using real user agents mimicking popular browsers like Chrome, Firefox, and Safari on Windows, iOS or Android.
Services like WhatIsMyBrowser provide up-to-date user agent strings for common configurations. Rotate randomly between a pool of these real values as you crawl:
import random
user_agents = [
‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36‘,
‘Mozilla/5.0 (iPhone; CPU iPhone OS 16_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.0 Mobile/15E148 Safari/604.1‘,
‘Mozilla/5.0 (Windows NT 10.0; WOW64; rv:107.0) Gecko/20100101 Firefox/107.0‘,
]
user_agent = random.choice(user_agents)
headers = {‘User-Agent‘: user_agent}
requests.get(url, headers=headers)
Mimicking browser user agents is crucial for any web scraper to act stealthily.
Crawling Like a Human
Bots often exhibit suspiciously fast, robotic crawling patterns. By injecting more human-like behaviors, your scraper can avoid easy detection. Useful techniques include:
Mouse movements – Move the mouse randomly around pages to mimic human reading patterns. Tools like Pyppeteer for Python allow controlling mouse movement.
Scrolling – Programmatically scroll pages during the crawl, don‘t just interact with fully loaded static content.
Delays – Wait a random interval of 5-15+ seconds between requests, avoiding instant rapid-fire crawling.
Clicks – Realistically click links, buttons, and other elements before interacting with them.
Form completion – Submit valid but fake data when required to fill out forms.
Organic navigation – Navigate site menus and links naturally, don‘t just hammer targeted URLs.
Reading patterns – Visit related content like a user browsing naturally, not laser focused on specific data extraction.
Here‘s sample Python code using the Humans package to add human actions:
from humans import actions
from selenium import webdriver
driver = webdriver.Chrome()
driver.get(‘https://example.com‘)
# Scroll randomly down page
actions.scroll(driver, down=True)
# Move mouse around
actions.move_mouse(driver, x=100, y=400, absolute=True, duration=1)
button = driver.find_element_by_css_selector(‘.signup-btn‘)
# Hover over button before clicking
actions.hover(driver, element=button)
# Click button
button.click()
# Wait 5-15 seconds
actions.wait(driver, min=5, max=15)
Blending in human movements makes it very difficult for sites to differentiate your bot from an organic visitor.
Control Your Crawling Speed
Bots can crawl exponentially faster than humans, pounding sites with requests. Excessively fast scraping is likely to trigger bot protections and blocks.
Here are some tips to limit your crawl rate:
- Set concurrent request limits (e.g. 10-50 concurrent) so you don‘t overload the site.
- Enforce random delays between requests as described above.
- Make requests through a pool of rotating proxies to further distribute load.
- Crawl during off-peak hours when site traffic is lower, such as nights and weekends.
- If you hit rate limits or find your requests getting blocked, pause scraping for a while before resuming more slowly.
- Identify high-value pages and data to scrape selectively vs. crawling the entire site.
As a rule of thumb, tread lightly and pace your bot about the same speed as a human visitor would navigate the site.
Here‘s sample code to throttle Scrapy spider speed, avoiding overly aggressive crawling:
class ThrottledSpider(CrawlSpider):
def __init__(self):
# Target 10 requests/second
self.rate = 10
# Set crawl delay to 100ms
self.crawl_delay = 0.1
# Init Scrapy crawler
self.crawler = Crawler(settings)
# ... spider settings ...
def start_requests(self):
# Loop target URLs
for url in urls:
# Calculate delay to hold rate
delay = len(self.crawler.engine.slot.inprogress) / float(self.rate)
time.sleep(delay)
# Pass off URL request
yield Request(url, callback=self.parse)
Get a feel for a site‘s thresholds and stay safely under the rate limits. This balanced crawling approach raises far fewer red flags.
Headless Browsers Render JavaScript
Many sites now use JavaScript to load key content. Normal data requests won‘t execute JavaScript, meaning you miss out on dynamic data only rendered in the browser.
Headless browsers like Puppeteer, Playwright, and Selenium run a real browser engine like Chrome in the background. This allows executing JavaScript to fully render pages like a real user‘s browser.
For example, here is Puppeteer loading an AJAX-populated page:
// Node.js example
const puppeteer = require(‘puppeteer‘);
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(‘https://example.com‘);
// Wait for AJAX content to load
await page.waitForSelector(‘.ajax-loaded‘);
const html = await page.content(); // Will contain AJAX data
console.log(html);
await browser.close();
})();
Headless browsers see the final rendered page like a real browser, not just the initial raw HTML.
Key advantages:
- JavaScript support to extract interactive data
- Can click buttons/links and submit forms
- Render pages closely to a real browser for added stealth
Downsides:
- Slower page load times than raw requests
- Added complexity to set up and interact with
- Higher memory usage
For simple scraping of static content, raw requests are faster and easier. Use headless browsing sparingly when you really need dynamic JavaScript content.
Popular headless tools include:
- Puppeteer – Headless Chrome browser with JavaScript API
- Playwright – Supports Chrome, Firefox and Safari browsers
- Selenium – Browser automation for Chrome, Firefox, Safari & Edge
The Peril of Honeypots
Some devious sites use hidden honeypots – links and buttons invisible to normal visitors, but detectable by scrapers auto-crawling the page.
When a crawler clicks these honeypots, the site knows it‘s an unwanted bot and can instantly block it.
Unfortunately, there‘s no surefire way for scrapers to avoid honeypots. You‘ll have to meticulously analyze pages and use judgment on whether elements look suspicious or invisible. Monitor for any blocks triggered after clicking certain links.
In general, only interact with obvious visible links and content on each page, avoiding anything suspicious. Limit use of overly aggressive crawling patterns that blindly click every element.
Outsourcing CAPTCHA Solving
CAPTCHAs present a pesky challenge for bots. The distorted text and images are designed so automated solvers fail.
To seamlessly handle CAPTCHAs during scraping, the best approach is using a CAPTCHA solving service. Services like Anti-Captcha and 2Captcha employ humans to solve CAPTCHAs at massive scale, costing just a few dollars per thousand CAPTCHAs.
When your bot encounters a CAPTCHA, these APIs allow forwarding it to be solved by a human solver:
Scraping tools like ProxiesAPI integrate CAPTCHA solvers so your bot never breaks its stride. For custom scrapers, you can incorporate solving APIs to detect, pass off, and solve CAPTCHAs automatically as they appear.
Advanced Browser Fingerprinting
Browser fingerprinting examines subtle differences between real browsers and bots to identify unique fingerprints. Sites inspect factors like:
- Screen size & resolution
- System fonts
- Installed plugins
- Browser version
- Timezone
- WebGL renderer configurations
- Canvas and WebAssembly benchmarks
When combined, these attributes create a unique fingerprint browsers can be identified by. Unusual or repeated fingerprints signify a bot.
Advanced tools like Luminati Enterprise Proxies and Web Unblocker from Oxylabs can mask your browser fingerprint to avoid fingerprinting blocks. They combine fingerprinting variables dynamically like a real browser.
For custom scrapers, you can also alter configurations to blend in better:
// Puppeteer example
const page = await browser.newPage();
await page.evaluateOnNewDocument(() => {
// Modify navigator properties
Object.defineProperty(navigator, ‘platform‘, { get: () => ‘MacIntel‘ });
// Set fake browser plugins
Object.defineProperty(navigator, ‘plugins‘, { get: () => [1, 2, 3, 4, 5] });
// Adjust WebGL config
const getParameter = WebGLRenderingContext.getParameter;
WebGLRenderingContext.getParameter = (parameter) => {
if (parameter === 37445) {
return ‘Intel Open Source Technology Center‘;
}
return getParameter(parameter);
};
});
Fingerprinting has become highly advanced, so commercial tools may be needed to properly mask it. But some custom adjustments can also help evade fingerprinting defenses.
Final Thoughts
After over a decade of web scraping experience, I‘ve learned firsthand how to overcome blocks and extract the data I need. While sites fight back against scrapers, the methods here form a battle-tested blueprint to stay under the radar.
The key is blending in – mimicking real browsers with realistic configurations, speeds, clicks and navigation. Proxies and fingerprint masking cloak your identity and activity. CAPTCHA solvers handle tedious challenges.
No solution is 100% bulletproof against all blocks. Expect to face some obstacles, but continuously refine your techniques and respect sites‘ boundaries. With practice, you can master secure, sustainable web scraping without disruptions.
Hopefully this guide provides a strong foundation for smooth scraping. I invite you to share your own tips and experiences in the comments!