Skip to content

Web Crawling in Python: Building Scalable Crawlers from Scratch to Scrapy

Web crawling is the process of programmatically downloading, extracting data from, and discovering new web pages to crawl. It powers search engines, fuels data science, drives business insights, and more. Python, with its ease of use and great ecosystem, has become a go-to language for building crawlers.

In this guide, we‘ll walk through crawling in Python from first principles to production-ready. You‘ll learn:

  • The basic components and flow of a web crawler
  • How to build a simple crawler from scratch using Python stdlib
  • Why you‘ll likely want to upgrade to using third-party crawling libraries
  • How to build robust, scalable crawlers using the popular Scrapy framework
  • Best practices and advanced techniques for crawling efficiently and reliably

Let‘s get crawling!

Web Crawling Fundamentals

At its core, a web crawler systematically and automatically browses the web by:

  1. Starting with a list of URLs to visit (the "seeds")
  2. Fetching the HTML content at each URL
  3. Extracting data and hyperlinks to new pages from the HTML
  4. Adding new page URLs to the list of pages to crawl
  5. Repeating the process with the new list of URLs

There are many variations and enhancements to this basic logic. But generally crawlers work to obtain a copy of all the interlinked pages to be processed later for data extraction, search indexing, etc.

A Basic Crawler using Python Standard Library

Python provides batteries included to build a basic web crawler using just the standard library. The urllib package fetches data across HTTP and the html.parser parses HTML:


import logging
from urllib.parse import urljoin
import urllib.request
import urllib.response
from html.parser import HTMLParser

logging.basicConfig(
format=‘%(asctime)s %(levelname)s:%(message)s‘,
level=logging.INFO)

class LinkExtractor(HTMLParser):
def init(self, base_url):
super().init()
self.base_url = base_url
self.links = set()

def handle_starttag(self, tag, attrs):
    if tag == ‘a‘:
        for (attr, val) in attrs:
            if attr == ‘href‘:
                url = urljoin(self.base_url, val)
                self.links.add(url)

class Crawler:
def init(self, base_url):
self.base_url = base_url
self.links_to_crawl = [base_url] self.crawled_links = set()

def extract_links(self, url):
    response = urllib.request.urlopen(url) 
    html = str(response.read())

    extractor = LinkExtractor(url)
    extractor.feed(html)

    return extractor.links

def crawl(self, url):
    self.crawled_links.add(url)
    logging.info(f‘Crawling: {url}‘)

    for link in self.extract_links(url):
        if link not in self.crawled_links:
            self.links_to_crawl.append(link)

def run(self):
    while self.links_to_crawl:
        link = self.links_to_crawl.pop(0)
        try:
            self.crawl(link)
        except Exception:
            logging.exception(f‘Failed to crawl: {link}‘)
    logging.info(‘Crawling complete‘)

if name == ‘main‘:
crawler = Crawler(base_url=‘https://www.scrapingbee.com/blog/‘)
crawler.run()

Here we define a Crawler class that:

  1. Tracks links already crawled and links remaining to crawl
  2. Can extract links from an HTML page
  3. Crawls each link, adding newly found links to the list to crawl
  4. Continues the process until all discovered links are visited

While it works, this DIY approach has notable limitations:

  • Very little of the hard work of crawling is handled (throttling, retries, etc)
  • Concurrency and efficiency are lacking, as requests are issued one at a time
  • Relative links, URL construction, and URL canonicalization aren‘t addressed
  • Not respecting robots.txt

For any non-trivial crawling, you‘ll want a bit more firepower. Let‘s see how third-party libraries help.

Crawling with Third-Party Python Libraries

The Python ecosystem provides some excellent libraries for many of the core crawling tasks. Three of the most useful are:

  • Requests – A simple, powerful library for making HTTP requests
  • Beautiful Soup – A library that makes extracting data from HTML/XML easy
  • lxml – A fast HTML and XML parser

Using these libraries, we can tighten up and expand our crawler:


import logging
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup

logging.basicConfig(
format=‘%(asctime)s %(levelname)s:%(message)s‘,
level=logging.INFO)

class Crawler:
def init(self, base_url):
self.base_url = base_url
self.links_to_crawl = [base_url] self.crawled_links = set()

def extract_links(self, url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, ‘lxml‘)

    for link in soup.find_all(‘a‘):
        path = link.get(‘href‘)
        if path and path.startswith(‘/‘):
            path = urljoin(url, path)
        yield path

def crawl(self, url):
    self.crawled_links.add(url)
    logging.info(f‘Crawling: {url}‘)

    for link in self.extract_links(url):
        if link not in self.crawled_links:
            self.links_to_crawl.append(link)

def run(self):
    while self.links_to_crawl:
        link = self.links_to_crawl.pop(0)
        try:
            self.crawl(link)
        except requests.exceptions.RequestException:
            logging.exception(f‘Failed to crawl: {link}‘)
    logging.info(‘Crawling complete‘)

if name == ‘main‘:
crawler = Crawler(base_url=‘https://www.scrapingbee.com/blog/‘)
crawler.run()

Some notable improvements:

  • Using Requests for simpler HTTP
  • Extracting links via Beautiful Soup for more power/readability
  • Handling relative links and URL construction correctly with urljoin
  • Logging Requests exceptions for better error handling

Nonetheless, handling all the other complexities of crawling from scratch is still a tall task.

For more serious crawling needs, a dedicated crawling framework is advisable. Let‘s refactor our crawler to use the popular Scrapy framework.

Scrapy: The Web Crawling Framework for Python

Scrapy is a fast, powerful and extensible web crawling framework. With a complete package of tools for scraping, crawling, processing items, and handling most crawling complexities, Scrapy is the leading Python framework for crawling at scale.

At a high level, Scrapy crawlers:

  • Start with a list of initial URLs to crawl
  • Extract data and new links from downloaded pages
  • Follow links matching certain criteria to crawl further
  • Feed extracted data into an item processing pipeline
  • Output processed items

Here‘s how we could implement our crawler in Scrapy:


import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BlogSpider(CrawlSpider):
name = ‘blog_spider‘
start_urls = [‘https://www.scrapingbee.com/blog/‘]

rules = (
    Rule(LinkExtractor(allow=‘blog/‘), callback=‘parse_item‘, follow=True),
)

def parse_item(self, response):
    yield {
        ‘url‘: response.url,
        ‘title‘: response.css(‘h1::text‘).get(),
        ‘date‘: response.css(‘time::text‘).get(),
        ‘author‘: response.css(‘.author::text‘).get()
    }

Much more concise than our previous code! How it works:

  1. The start_urls list the initial pages to crawl
  2. Rules are defined with LinkExtractors to extract new page links from page responses
  3. parse_item extracts data from pages into a Python dict
  4. Scrapy handles the crawling "plumbing" – concurrency, throttling, retries, etc

To run it:


scrapy runspider blog_spider.py

Scrapy handles the major challenges of crawling out of the box:

  • Supports concurrent requests for improved performance
  • Respects robots.txt
  • Provides auto-throttling and user agent spoofing
  • Handles cookies and authentication
  • Enables crawl depth limiting, URL length restrictions, and more

Of course, for specific cases you may still need to customize and extend Scrapy. But it provides a sane and scalable base for building production-grade crawlers that can tackle even the largest sites on the web.

Crawling Best Practices and Advanced Techniques

When crawling, especially at scale, there are a number of best practices to keep in mind:

  • Respect robots.txt – It‘s good ethical practice and may keep you from getting blocked
  • Set a descriptive user agent string, perhaps even with a URL, so sites can contact you
  • Implement autothrottle to avoid hitting servers too hard
  • Limit concurrent requests to a reasonable level
  • Set a crawl delay between requests to each domain
  • Avoid crawler traps – Limit max URL length, avoid query parameters, set a max crawl depth
  • Handle errors gracefully – Retry intermittent failures, log exceptions for later analysis

Some advanced techniques that may be worth exploring:

  • Distributed crawling – Run multiple crawler instances in parallel for better performance
  • Crawl data enrichment – Enrich crawled data with additional info from APIs, NLP, etc
  • Incremental crawling – Only crawl new/updated pages for more efficient recurring crawls
  • Dynamic rendering – Use a headless browser to crawl JavaScript-heavy sites

Learning More

We‘ve covered a lot in this guide, but web crawling is a deep topic where you never stop learning. I‘ve found one of the best ways to level up your crawling skills is to pick a popular site and try to crawl it, adapting to challenges as they arise.

Some great resources to continue learning:

  • Scrapy Documentation – The authoritative guide to the leading Python crawling framework
  • Web Scraping with Python – A comprehensive book on all things web scraping and crawling in Python
  • Crawling and Scraping the Web – How Google built their production crawling systems

Happy crawling!

Join the conversation

Your email address will not be published. Required fields are marked *