Master Web Scraping with Scrapy in Python

Web scraping is the process of extracting data from websites programmatically. Scrapy is a popular open-source web scraping framework for Python that makes it easy to scrape data from the web at scale. In this comprehensive guide, you‘ll learn how to use Scrapy to build robust web scrapers.

What is Scrapy?

Scrapy is an open-source web crawling and web scraping framework for Python. It provides a high-level API for extracting data from websites quickly and efficiently.

Some key features of Scrapy:

Built-in support for crawling websites recursively and following links.
Flexible mechanism for extracting data using CSS selectors and XPath expressions.
Built-in support for parsing HTML and XML content.
Handy for scraping JavaScript-heavy sites by integrating with browsers like Selenium.
Simple way to scrape data using asynchronous I/O for performance.
Export scraped data to JSON, CSV, XML formats.
Able to handle large scraping projects involving thousands of requests.
Extend functionality using middlewares, extensions, and pipelines.
Wide range of ready-to-use open-source spiders for popular websites.

In a nutshell, Scrapy provides all the functionality needed for building robust web scrapers of any scale and complexity.

Creating Your First Scrapy Spider

The best way to understand Scrapy is to create a simple spider. Let‘s see how to build a spider that scrapes quotes from the website http://quotes.toscrape.com.

First, install Scrapy:

pip install scrapy

Next, create a new Scrapy project called myquotes:

scrapy startproject myquotes

This will create a myquotes directory with the following contents:

myquotes/
    scrapy.cfg
    myquotes/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

We‘ll be writing our code inside spiders/. Generate a new spider called quotes_spider.py by running:

cd myquotes
scrapy genspider quotes_spider quotes.toscrape.com

This will generate the file quotes_spider.py with boilerplate code for our spider. Let‘s modify it to scrape quotes:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = ‘quotes‘
    allowed_domains = [‘quotes.toscrape.com‘]
    start_urls = [‘http://quotes.toscrape.com/‘]

    def parse(self, response):
        quotes = response.css(‘.quote‘)
        for quote in quotes:
            text = quote.css(‘.text::text‘).extract_first()
            author = quote.css(‘.author::text‘).extract_first()
            tags = quote.css(‘.tags .tag::text‘).extract()

            yield {
                ‘text‘: text,
                ‘author‘: author,
                ‘tags‘: tags
            }

This spider will:

Crawl the website starting at start_urls.
Use the parse() method to extract data from the response.
Find all .quote elements on the page and extract the quote text, author, and tags.
Yield a Python dict with the extracted data for each quote.

To run this spider:

scrapy crawl quotes

This will scrape data from the quotes website and output it to the console. You can also save results to a file by passing -o filename.json.

And that‘s it! You‘ve created your first Scrapy spider. Next let‘s go over some key concepts in detail.

Scrapy Architecture Overview

Scrapy is built around the following main components:

Engine

The engine is responsible for controlling the data flow between all components of Scrapy. It triggers events when certain actions occur, such as starting a spider or completing a request.

Scheduler

The scheduler receives requests from the engine and enqueues them for the downloader to scrape. It prioritizes requests based on different queues and optimizes scraping efficiency.

Downloader

The downloader handles fetching web pages and feeding responses back to the engine. It manages multiple concurrent requests efficiently.

Spiders

Spiders are core components where you implement the scraping logic. They start crawling from the defined URLs and parse responses using parsers.

Item Pipeline

Pipelines process scraped items. They are used for validating, cleansing, storing, and post-processing data. Multiple pipelines can be enabled to form an item processing chain.

Downloader middlewares

Downloader middlewares sit between the engine and the downloader and modify requests before they are sent and responses before they are returned to the engine. They are used for things like request throttling, caching, headers modification etc.

Spider middlewares

Spider middlewares are hooks that sit between the engine and the spider and are called before spider methods. They are used to extend and modify spider behavior.

This architecture makes Scrapy highly modular and flexible to build robust crawlers. Next, let‘s dig deeper into spiders.

Understanding Scrapy Spiders

The spider is the component that controls the scraping process in Scrapy. The main tasks of spiders are:

Start crawling from one or more defined URLs.
Follow links to scrape content recursively.
Parse responses using parsers like CSS and XPath selectors.
Return scraped data as dicts, Items or other objects.

There are several types of built-in spiders in Scrapy:

CrawlSpider

CrawlSpider is used for crawling and scraping data from multiple web pages within a domain (or group of domains). It comes with useful functionality like following links and rules.

For example:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class MySpider(CrawlSpider):

    name = ‘crawlspider‘ 

    allowed_domains = [‘example.com‘]

    start_urls = [‘http://www.example.com‘]

    rules = (
        Rule(LinkExtractor(allow=r‘Items/‘), callback=‘parse_item‘, follow=True),
    )

    def parse_item(self, response):
        # Scrape data and return Items here 
        pass

This spider will start crawling from example.com, follow all links matching Items/ and call the parse_item() method to scrape data from each response.

XMLFeedSpider

XMLFeedSpider is designed for scraping data from XML feeds. You provide it with the URLs of the feeds and then fetch data from the nodes using XPath.

For example:

from scrapy.spiders import XMLFeedSpider


class MySpider(XMLFeedSpider):
    name = ‘xmlspider‘
    allowed_domains = [‘example.com‘]
    start_urls = [‘http://www.example.com/feed.xml‘]
    iterator = ‘iternodes‘ # This is actually unnecessary, since it‘s the default value
    itertag = ‘item‘

    def parse_node(self, response, node):
        # Extract data from <item> nodes using XPath
        pass

This spider will scrape data from the XML feed located at http://www.example.com/feed.xml.

There are other built-in spider types like CSVFeedSpider, SitemapSpider etc catered for different purposes. You can even build your own spider classes.

Selectors for extracting data

Scrapy provides Selectors for extracting data from HTML and XML responses using CSS selectors or XPath expressions.

For example:

response.css(‘div.quote‘).extract() # Get all <div class="quote"> elements
response.xpath(‘//div‘) # Get all <div> elements

quote = response.css(‘div.quote‘)[0] 
quote.css(‘span.text::text‘).get() # Extract text from <span> inside first <div>
quote.xpath(‘./span/text()‘).get() # Alternative way with XPath

You can even select data from an element and extract attributes:

for link in response.css(‘ul.links li a‘):
    link_name = link.xpath(‘./text()‘).get()
    link_url = link.xpath(‘./@href‘).get()

This makes it very easy to find and extract the data you need from responses.

Handling JavaScript Pages

By default, Scrapy spiders cannot scrape JavaScript-heavy pages, since Scrapy only sees the initial HTML returned by the server. To scrape dynamic content loaded via JavaScript, you can integrate Scrapy with a browser automation tool like Playwright or Selenium.

The easiest way is to use the scrapy-playwright extension which integrates Playwright with Scrapy.

First install it:

pip install scrapy-playwright

Then enable it by adding this middleware:

DOWNLOADER_MIDDLEWARES = {
    ‘scrapy_playwright.middleware.PlaywrightMiddleware‘: 1,
}

Finally, set playwright=True in the Request meta to render pages with Playwright:

yield scrapy.Request(url, meta={
    ‘playwright‘: True,
})

Playwright will load the JavaScript on each page before passing the rendered HTML to Scrapy for scraping.

This allows seamlessly scraping modern JavaScript websites with Scrapy and Playwright!

Storing Scraped Data

By default, Scrapy prints scraped data to the console. There are several ways to store scraped data:

JSON, CSV, XML feeds

The easiest way is to save scraped items to a file using the -o flag:

scrapy crawl quotes -o quotes.json

This will save all scraped items to a JSON file. CSV, JSONL, XML and other formats are supported too.

Pipeline to database

For structured data storage, you can write a pipeline to store items in databases like MongoDB, PostgreSQL etc.

For example, a MongoDB pipeline:

import pymongo

class MongoPipeline(object):

    collection_name = ‘quotes‘

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get(‘MONGO_URI‘),
            mongo_db=crawler.settings.get(‘MONGO_DATABASE‘)
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

This pipeline connects to MongoDB and inserts all scraped items into a collection.

Custom storage backends

You can write scrapy.Exporter subclasses to store data in any way you want, like a custom database or analytics system.

Some popular storage options:

Scrapy Cloud – Stores scraped items in the cloud and provides a web interface to access data.
Kafka – Stream scraped items to a Kafka cluster.
Elasticsearch – Index and query scraped items in Elasticsearch.

Crawling Tips

Here are some tips for crawling effectively with Scrapy:

Initialize the allowed_domains attribute to restrict crawling to a single domain (or small group of domains).
Set a small DOWNLOAD_DELAY like 1-2 seconds to avoid overwhelming sites.
Disable cookies by setting COOKIES_ENABLED = False if not needed to improve performance.
Use the CONCURRENT_REQUESTS setting to adjust the number of concurrent requests. Start with a low value like 8-16.
Create one spider per website if scraping multiple domains.
Use scrapy shell for quick interactive testing of selectors.
Monitor scraping status and stats with the Scrapyd web UI.
Use services like ProxyMesh to route requests through residential proxies and avoid IP bans.

Advanced Features

Some more advanced features of Scrapy:

Dynamic Crawling with FormRequest

Use FormRequest and form data to mimic submitting HTML forms:

data = {
  ‘search_query‘: ‘scraping‘ 
}

yield FormRequest(url=‘http://quotes.toscrape.com/search‘, formdata=data, callback=self.parse_search_results)

Post-Processing with Item Pipeline

Item pipelines allow processing scraped items. Useful for:

Data validation and cleansing
Deduplication
Storing data to databases
Sending items to API endpoints

Multiple pipelines can be enabled by ordering them via the ITEM_PIPELINES setting.

HTTP Caching

Enable built-in caching to avoid re-downloading frequently accessed pages:

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0  # Cache forever
HTTPCACHE_DIR = ‘cache‘
HTTPCACHE_IGNORE_HTTP_CODES = [500, 503, 504]

Custom Spider Middlewares

Spider middlewares allow injecting custom code before/after spider methods execute. Useful for wrapping scrapers with proxy rotation, retries, etc.

class UserAgentMiddleware:

    def process_request(self, request, spider):
      request.headers[‘User-Agent‘] = random_user_agent()

Distributed Crawling with Scrapyd and Scrapy Cloud

Tools like Scrapyd and Scrapy Cloud make it easy to run Scrapy spiders on multiple servers to scale up scraping.

Conclusion

And there you have it – a comprehensive guide to web scraping with Scrapy in Python. Scrapy is a versatile tool that can handle everything from simple single-page scrapers to large distributed crawling projects. The key strengths are its simple but powerful extraction mechanisms, built-in handling of asynchronicity, and extensive options for post-processing and storing scraped data.

Some key topics we covered included:

Scrapy‘s architecture and main components like spiders, pipelines, middlewares etc.
Creating basic spiders by subclassing scrapy.Spider.
Using selector expressions to extract data from HTML and XML.
Crawling multiple pages by using spiders like CrawlSpider.
Scraping JavaScript pages by integrating tools like Playwright.
Storing scraped items in different formats or databases.
Following best practices for effective crawling and avoiding bans.
Advanced features like middlewares, caching, distributed crawling etc.

To summarize, Scrapy provides a robust framework for building production-grade web scrapers of any complexity. With a little care and planning, you can leverage Scrapy to extract data from almost any website out there. Happy scraping!

What is Scrapy?

Creating Your First Scrapy Spider

Scrapy Architecture Overview

Engine

Scheduler

Downloader

Spiders

Item Pipeline

Downloader middlewares

Spider middlewares

Understanding Scrapy Spiders

CrawlSpider

XMLFeedSpider

Selectors for extracting data

Handling JavaScript Pages

Storing Scraped Data

JSON, CSV, XML feeds

Pipeline to database

Custom storage backends

Crawling Tips

Advanced Features

Dynamic Crawling with FormRequest

Post-Processing with Item Pipeline

HTTP Caching

Custom Spider Middlewares

Distributed Crawling with Scrapyd and Scrapy Cloud

Conclusion

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python