Skip to content

The Complete Guide to Web Scraping with Scrapy in Python

Welcome to my mega-tutorial on web scraping with Scrapy!

I‘ve been a data extraction specialist for over 10 years. In that time, I‘ve used pretty much every Python web scraping library under the sun. And without a doubt, Scrapy comes out on top for large scale production scraping.

In this guide, I‘m excited to walk you through the key things you need to use Scrapy effectively. I‘ll be sharing lots of real-world examples, code snippets, visuals and hard-earned advice – all explained in simple terms.

So buckle up, and let‘s get scraping!

Why Choose Scrapy for Web Scraping?

There are several great web scraping libraries in Python like BeautifulSoup, Selenium, etc. But here are some key reasons why I recommend Scrapy as the best choice:

Speed – Scrapy is extremely fast because it can send asynchronous requests and scrape multiple pages concurrently. Benchmarks show Scrapy generating 5000+ requests per minute on modest hardware. This parallelization makes it ideal for scraping large sites.

Powerful Extraction Tools – Scrapy Selectors using CSS, XPath, RegEx etc. make extracting text and data extremely easy. You don‘t have to parse messy HTML – just identify patterns.

Batteries Included – Scrapy provides batteries-included support for pagination, throttling requests, cookies, proxies etc. This saves you from reinventing the wheel.

Large Ecosystem – There are 500+ extensions providing integration with storage, web drivers, caching etc. This enables adding functionality as you need it.

LibrarySpeedScalabilityJavascript SupportLearning Curve
ScrapyVery FastExcellentVia extensionsModerate
BeautifulSoupSlowPoorNoEasy
SeleniumSlowModerateExcellentDifficult

As you can see, Scrapy provides the best blend of speed, scalability and ease of use. That‘s why top tech firms like Scrapy for large scale production web scraping.

Next, let me walk you through getting started with Scrapy.

Installing Scrapy

Scrapy is written purely in Python and has minimal dependencies. To install:

pip install scrapy

That‘s it! Scrapy will automatically install Python packages like Twisted, Parsel etc.

Once installed, you can verify by running:

scrapy version

This will print the currently installed version. At the time of writing, the latest stable version is 2.7.1.

With Scrapy installed, you are ready to create your first project!

Creating a New Scrapy Project

Scrapy organizes scraping code into "Projects" consisting of Spiders, Pipelines etc. To create a new project:

scrapy startproject myproject

This generates a myproject directory with the following structure:

myproject
├── myproject/
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders/
│       └── __init__.py
└── scrapy.cfg

This layout may seem complex initially. But it helps organize large projects and provides separation of concerns. Here‘s what each file/folder does:

  • spiders/ – Where our scraping code (Spiders) reside
  • items.py – Defines datatypes for scraped data
  • pipelines.py – Handles data processing/storage
  • settings.py – Provides runtime configuration
  • middlewares.py – Implements custom downloader middlewares
  • scrapy.cfg – Deployment configuration for Scrapy

Don‘t worry about all the files for now. We‘ll mostly be working with Spiders under ./spiders folder.

Let‘s learn how Scrapy spiders work!

Anatomy of a Scrapy Spider

Spiders are classes that define the scraping logic for a site (or group of sites). Here are the key components of a spider:

  • name – Unique identifier for the spider
  • allowed_domains – List of allowed domains to scrape
  • start_urls – List of URLs to start crawling from

The main scraping logic is written inside two methods:

  • parse() – Extracts data from responses
  • request() – Generates requests to follow links

When we run a spider, Scrapy calls the parse() method on the downloaded response of each URL we requested.

The parse() method analyzes the response using Selectors and yields:

  • The extracted data
  • Additional URLs to scrape by generating Request objects

This allows recursively crawling pages by following links. Now let‘s see this in action!

Our First Spider

Let‘s write a simple spider to scrape product listings from the site books.toscrape.com.

First, we‘ll generate a spider template named books_spider.py using:

scrapy genspider books_spider books.toscrape.com

This creates a spider with the URLs pre-populated:

import scrapy

class BooksSpider(scrapy.Spider):
    name = ‘books_spider‘
    allowed_domains = [‘books.toscrape.com‘]
    start_urls = [‘http://books.toscrape.com/‘]

    def parse(self, response):
        pass

Now let‘s fill out the parse() method to extract the product title and price:

    def parse(self, response):
        for product in response.css(‘article.product_pod‘):

            title = product.css(‘h3 > a::attr(title)‘).get()
            price = product.css(‘.price_color::text‘).get()

            yield {
                ‘title‘: title,
                ‘price‘: price
            }

Here we:

  • Loop through the product_pod elements
  • Extract title and price using CSS Selectors
  • Yield a Python dict with the scraped data

And that‘s it! Our first little web scraper is ready. Let‘s run it:

scrapy crawl books_spider

This will crawl the URLs starting at books.toscrape.com and extract titles and prices into a JSON file.

Now that we have the basics down, let‘s learn to scrape across paginated pages.

Handling Pagination in Scrapy

Most websites split content across multiple pages. To scrape them all, we need to follow pagination links recursively.

Here is how our books spider can handle pagination:

class BooksSpider(scrapy.Spider):

  # ...

  def parse(self, response):

    # Scraping logic..  

    next_page = response.css(‘li.next a::attr(href)‘).get()

    if next_page is not None:
        yield response.follow(next_page, callback=self.parse) 

We first extract the next page URL using a CSS Selector. Then we yield another Request to Scrapy passing in:

  • next_page URL
  • Callback as self.parse again

This makes Scrapy recursively call parse() on each page response until there are no more next pages found. Pretty neat!

Scraping paginated content is a breeze with Scrapy. Next let‘s see how to store our scraped data.

Storing Scraped Data

By default, Scrapy returns extracted data using generators. We can store them in files/databases by:

1. Feed exports

For example, to export scraped data as a JSON file:

from scrapy.exporters import JsonItemExporter

class BooksSpider(scrapy.Spider):

  # ..spider code

  def close(self, reason):  
      jsonFile = open(‘books.json‘, ‘wb‘)
      exporter = JsonItemExporter(jsonFile)
      exporter.start_exporting()

      for item in self.scraped_items: 
          exporter.export_item(item)

      exporter.finish_exporting()
      jsonFile.close()

Scrapy provides built-in exporters for JSON, CSV and XML formats.

2. Pipelines

For advanced storage, you can write Item Pipelines. Some examples of pipelines:

  • Validate scraped data
  • Deduplicate duplicate items
  • Store data in databases
  • Upload images to cloud storage

Pipelines let you post-process items as they are scraped. They are extremely useful for data cleaning and storage.

Now that we know the basics of spiders and scraping pages, let‘s dive a bit deeper.

Configuring Scrapy Settings

Scrapy uses a settings file to control its runtime behavior and functionality. Some key settings include:

USER_AGENT – The browser User-Agent string to send with requests. Randomizing this helps avoid detections.

ROBOTSTXT_OBEY – Whether robots.txt rules should be obeyed. Set to False to ignore.

COOKIES_ENABLED – Whether to enable cookie handling. Disable for headless scraping.

CONCURRENT_REQUESTS – The maximum number of concurrent requests to send. Reduce this to avoid throttling.

DOWNLOAD_DELAY – The delay in seconds to wait before sending each request. Increase to throttle scraping rate.

ADDONS – A dict containing addons & extensions enabled for the project.

There are many more settings to tweak Scrapy‘s functionality. I recommend mastering the core settings to customize scraping behavior.

Now let‘s look at some handy tools for debugging Scrapy spiders.

Debugging Spiders with Scrapy Shell

Scrapy provides an incredibly useful tool called the Scrapy Shell for testing responses and selectors.

To try it out, run:

scrapy shell "http://books.toscrape.com" 

This drops you into a Python shell with the response loaded and scoped:

In [1]: response.css(‘title‘)
Out[1]: [<Selector xpath=‘descendant-or-self::title‘ data=‘<title>All products</title>‘>]

In [2]: response.xpath(‘//h1/text()‘)
Out[2]: [<Selector xpath=‘//h1/text()‘ data=‘All products‘>]

The shell lets you interactively test CSS/XPath selectors and see how Scrapy analyzes responses.

I use it extensively for almost all my scraping projects to fine-tune extraction without running the spiders repeatedly. It often catches 90% of selector issues before I even write parse code!

Learning to leverage the shell will massively boost your productivity. Scrapy just made debugging fun 🙂

Using Proxies and User Agents

While scraping, it‘s common to get blocked by sites if you:

  • Overwhelm the server with too many requests
  • Don‘t use browser-like request headers

Here are two simple tweaks to prevent blocks:

1. Rotate User Agents

Set the USER_AGENT setting to rotate random user agents:

USER_AGENT = ‘RandomUserAgentMiddleware‘ #Rotating user-agent middleware

This varies the User-Agent header to make your traffic look more human.

2. Use Proxy Rotation

To prevent IP based blocks, you can route requests through proxies:

class ProxyMiddleware:

  def process_request(self, request, spider):
    request.meta[‘proxy‘] = random.choice(PROXIES)

Here we randomly assign a proxy from a list to each request. This distributes loads across IPs to avoid bans.

And that‘s barely scratching the surface of what‘s possible by mixing and matching Scrapy components!

Let‘s now tackle a common challenge – scraping dynamic JavaScript powered websites.

Scraping JavaScript Websites

A major limitation of Scrapy is that it only sees static HTML content initially returned by websites.

Modern sites rely heavily on JavaScript to render content. To scrape them, we need to parse JavaScript and wait for the site to load dynamically.

Scrapy provides integration with tools like Splash and Playwright to render JavaScript pages.

Here is how you can configure Scrapy to use Splash:

SPLASH_URL = ‘http://localhost:8050‘ #Running Splash instance

DOWNLOADER_MIDDLEWARES = {
    ‘scrapy_splash.SplashCookiesMiddleware‘: 723,
    ‘scrapy_splash.SplashMiddleware‘: 725,
    ‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware‘: 810,
}

SPIDER_MIDDLEWARES = {
    ‘scrapy_splash.SplashDeduplicateArgsMiddleware‘: 100,
}

DUPEFILTER_CLASS = ‘scrapy_splash.SplashAwareDupeFilter‘ 

We add the middleware classes required for Splash integration.

To scrape a page, you can now request it via Splash:

yield SplashRequest(url, args={‘wait‘: 3})

This will render the page with a 3 second wait before extracting HTML. Much more powerful than Selenium!

Crawl Best Practices for Avoiding Bans

Through years of web scraping, I‘ve learned some key best practices the hard way. Here are a few tips:

  • Respect robots.txt and check a site‘s policies before scraping
  • Set a reasonable DOWNLOAD_DELAY of 2+ seconds
  • Disable cookies unless absolutely necessary
  • Use proxies and rotate User Agents with each request
  • Avoid scraping too fast – limit to <100 requests/min
  • Analyze scraped data and catch regressions quickly
  • Retry on common failure cases like 500 errors

Following standards and crawling politely helps avoid targeted blocking. Mastering Scrapy while keeping these principles in mind will get you far!

With that, we‘ve reached the end of our Scrapy web scraping journey! Let‘s wrap up with some key takeaways.

Key Takeaways

  • Scrapy provides a very fast and powerful framework for web scraping at scale.
  • It makes common scraping tasks straightforward with its batteries-included libraries and tools.
  • Spiders abstract away the complexity of recursively crawling through pages.
  • Scraped data can be exported and stored in human-friendly formats like CSV/JSON.
  • The Settings module and Shell help customize behavior and debug scrapers.
  • Using plugins like Splash allows handling JavaScript heavy sites.
  • Following standards and best practices helps avoid scraper failures.

I hope you found this guide useful! Don‘t hesitate to reach out if you have any other questions. I‘m always happy to help fellow web scraping enthusiasts.

Now it‘s your turn to build something awesome with Scrapy 🙂 Happy Coding!

Join the conversation

Your email address will not be published. Required fields are marked *