Web scraping is the process of extracting data from websites programmatically. Scrapy is a popular open-source web scraping framework for Python that makes it easy to scrape data from the web at scale. In this comprehensive guide, you‘ll learn how to use Scrapy to build robust web scrapers.
What is Scrapy?
Scrapy is an open-source web crawling and web scraping framework for Python. It provides a high-level API for extracting data from websites quickly and efficiently.
Some key features of Scrapy:
- Built-in support for crawling websites recursively and following links.
- Flexible mechanism for extracting data using CSS selectors and XPath expressions.
- Built-in support for parsing HTML and XML content.
- Handy for scraping JavaScript-heavy sites by integrating with browsers like Selenium.
- Simple way to scrape data using asynchronous I/O for performance.
- Export scraped data to JSON, CSV, XML formats.
- Able to handle large scraping projects involving thousands of requests.
- Extend functionality using middlewares, extensions, and pipelines.
- Wide range of ready-to-use open-source spiders for popular websites.
In a nutshell, Scrapy provides all the functionality needed for building robust web scrapers of any scale and complexity.
Creating Your First Scrapy Spider
The best way to understand Scrapy is to create a simple spider. Let‘s see how to build a spider that scrapes quotes from the website http://quotes.toscrape.com.
First, install Scrapy:
pip install scrapy
Next, create a new Scrapy project called myquotes
:
scrapy startproject myquotes
This will create a myquotes
directory with the following contents:
myquotes/
scrapy.cfg
myquotes/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
We‘ll be writing our code inside spiders/
. Generate a new spider called quotes_spider.py
by running:
cd myquotes
scrapy genspider quotes_spider quotes.toscrape.com
This will generate the file quotes_spider.py
with boilerplate code for our spider. Let‘s modify it to scrape quotes:
import scrapy
class QuotesSpider(scrapy.Spider):
name = ‘quotes‘
allowed_domains = [‘quotes.toscrape.com‘]
start_urls = [‘http://quotes.toscrape.com/‘]
def parse(self, response):
quotes = response.css(‘.quote‘)
for quote in quotes:
text = quote.css(‘.text::text‘).extract_first()
author = quote.css(‘.author::text‘).extract_first()
tags = quote.css(‘.tags .tag::text‘).extract()
yield {
‘text‘: text,
‘author‘: author,
‘tags‘: tags
}
This spider will:
- Crawl the website starting at
start_urls
. - Use the
parse()
method to extract data from the response. - Find all
.quote
elements on the page and extract the quotetext
,author
, andtags
. - Yield a Python dict with the extracted data for each quote.
To run this spider:
scrapy crawl quotes
This will scrape data from the quotes website and output it to the console. You can also save results to a file by passing -o filename.json
.
And that‘s it! You‘ve created your first Scrapy spider. Next let‘s go over some key concepts in detail.
Scrapy Architecture Overview
Scrapy is built around the following main components:
Engine
The engine is responsible for controlling the data flow between all components of Scrapy. It triggers events when certain actions occur, such as starting a spider or completing a request.
Scheduler
The scheduler receives requests from the engine and enqueues them for the downloader to scrape. It prioritizes requests based on different queues and optimizes scraping efficiency.
Downloader
The downloader handles fetching web pages and feeding responses back to the engine. It manages multiple concurrent requests efficiently.
Spiders
Spiders are core components where you implement the scraping logic. They start crawling from the defined URLs and parse responses using parsers.
Item Pipeline
Pipelines process scraped items. They are used for validating, cleansing, storing, and post-processing data. Multiple pipelines can be enabled to form an item processing chain.
Downloader middlewares
Downloader middlewares sit between the engine and the downloader and modify requests before they are sent and responses before they are returned to the engine. They are used for things like request throttling, caching, headers modification etc.
Spider middlewares
Spider middlewares are hooks that sit between the engine and the spider and are called before spider methods. They are used to extend and modify spider behavior.
This architecture makes Scrapy highly modular and flexible to build robust crawlers. Next, let‘s dig deeper into spiders.
Understanding Scrapy Spiders
The spider is the component that controls the scraping process in Scrapy. The main tasks of spiders are:
- Start crawling from one or more defined URLs.
- Follow links to scrape content recursively.
- Parse responses using parsers like CSS and XPath selectors.
- Return scraped data as dicts, Items or other objects.
There are several types of built-in spiders in Scrapy:
CrawlSpider
CrawlSpider is used for crawling and scraping data from multiple web pages within a domain (or group of domains). It comes with useful functionality like following links and rules.
For example:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = ‘crawlspider‘
allowed_domains = [‘example.com‘]
start_urls = [‘http://www.example.com‘]
rules = (
Rule(LinkExtractor(allow=r‘Items/‘), callback=‘parse_item‘, follow=True),
)
def parse_item(self, response):
# Scrape data and return Items here
pass
This spider will start crawling from example.com
, follow all links matching Items/
and call the parse_item()
method to scrape data from each response.
XMLFeedSpider
XMLFeedSpider is designed for scraping data from XML feeds. You provide it with the URLs of the feeds and then fetch data from the nodes using XPath.
For example:
from scrapy.spiders import XMLFeedSpider
class MySpider(XMLFeedSpider):
name = ‘xmlspider‘
allowed_domains = [‘example.com‘]
start_urls = [‘http://www.example.com/feed.xml‘]
iterator = ‘iternodes‘ # This is actually unnecessary, since it‘s the default value
itertag = ‘item‘
def parse_node(self, response, node):
# Extract data from <item> nodes using XPath
pass
This spider will scrape data from the XML feed located at http://www.example.com/feed.xml
.
There are other built-in spider types like CSVFeedSpider
, SitemapSpider
etc catered for different purposes. You can even build your own spider classes.
Selectors for extracting data
Scrapy provides Selectors for extracting data from HTML and XML responses using CSS selectors or XPath expressions.
For example:
response.css(‘div.quote‘).extract() # Get all <div class="quote"> elements
response.xpath(‘//div‘) # Get all <div> elements
quote = response.css(‘div.quote‘)[0]
quote.css(‘span.text::text‘).get() # Extract text from <span> inside first <div>
quote.xpath(‘./span/text()‘).get() # Alternative way with XPath
You can even select data from an element and extract attributes:
for link in response.css(‘ul.links li a‘):
link_name = link.xpath(‘./text()‘).get()
link_url = link.xpath(‘./@href‘).get()
This makes it very easy to find and extract the data you need from responses.
Handling JavaScript Pages
By default, Scrapy spiders cannot scrape JavaScript-heavy pages, since Scrapy only sees the initial HTML returned by the server. To scrape dynamic content loaded via JavaScript, you can integrate Scrapy with a browser automation tool like Playwright or Selenium.
The easiest way is to use the scrapy-playwright
extension which integrates Playwright with Scrapy.
First install it:
pip install scrapy-playwright
Then enable it by adding this middleware:
DOWNLOADER_MIDDLEWARES = {
‘scrapy_playwright.middleware.PlaywrightMiddleware‘: 1,
}
Finally, set playwright=True
in the Request
meta to render pages with Playwright:
yield scrapy.Request(url, meta={
‘playwright‘: True,
})
Playwright will load the JavaScript on each page before passing the rendered HTML to Scrapy for scraping.
This allows seamlessly scraping modern JavaScript websites with Scrapy and Playwright!
Storing Scraped Data
By default, Scrapy prints scraped data to the console. There are several ways to store scraped data:
JSON, CSV, XML feeds
The easiest way is to save scraped items to a file using the -o
flag:
scrapy crawl quotes -o quotes.json
This will save all scraped items to a JSON file. CSV, JSONL, XML and other formats are supported too.
Pipeline to database
For structured data storage, you can write a pipeline to store items in databases like MongoDB, PostgreSQL etc.
For example, a MongoDB pipeline:
import pymongo
class MongoPipeline(object):
collection_name = ‘quotes‘
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get(‘MONGO_URI‘),
mongo_db=crawler.settings.get(‘MONGO_DATABASE‘)
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert_one(dict(item))
return item
This pipeline connects to MongoDB and inserts all scraped items into a collection.
Custom storage backends
You can write scrapy.Exporter subclasses to store data in any way you want, like a custom database or analytics system.
Some popular storage options:
- Scrapy Cloud – Stores scraped items in the cloud and provides a web interface to access data.
- Kafka – Stream scraped items to a Kafka cluster.
- Elasticsearch – Index and query scraped items in Elasticsearch.
Crawling Tips
Here are some tips for crawling effectively with Scrapy:
-
Initialize the
allowed_domains
attribute to restrict crawling to a single domain (or small group of domains). -
Set a small
DOWNLOAD_DELAY
like 1-2 seconds to avoid overwhelming sites. -
Disable cookies by setting
COOKIES_ENABLED = False
if not needed to improve performance. -
Use the
CONCURRENT_REQUESTS
setting to adjust the number of concurrent requests. Start with a low value like 8-16. -
Create one spider per website if scraping multiple domains.
-
Use
scrapy shell
for quick interactive testing of selectors. -
Monitor scraping status and stats with the Scrapyd web UI.
-
Use services like ProxyMesh to route requests through residential proxies and avoid IP bans.
Advanced Features
Some more advanced features of Scrapy:
Dynamic Crawling with FormRequest
Use FormRequest
and form data to mimic submitting HTML forms:
data = {
‘search_query‘: ‘scraping‘
}
yield FormRequest(url=‘http://quotes.toscrape.com/search‘, formdata=data, callback=self.parse_search_results)
Post-Processing with Item Pipeline
Item pipelines allow processing scraped items. Useful for:
- Data validation and cleansing
- Deduplication
- Storing data to databases
- Sending items to API endpoints
Multiple pipelines can be enabled by ordering them via the ITEM_PIPELINES
setting.
HTTP Caching
Enable built-in caching to avoid re-downloading frequently accessed pages:
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # Cache forever
HTTPCACHE_DIR = ‘cache‘
HTTPCACHE_IGNORE_HTTP_CODES = [500, 503, 504]
Custom Spider Middlewares
Spider middlewares allow injecting custom code before/after spider methods execute. Useful for wrapping scrapers with proxy rotation, retries, etc.
class UserAgentMiddleware:
def process_request(self, request, spider):
request.headers[‘User-Agent‘] = random_user_agent()
Distributed Crawling with Scrapyd and Scrapy Cloud
Tools like Scrapyd and Scrapy Cloud make it easy to run Scrapy spiders on multiple servers to scale up scraping.
Conclusion
And there you have it – a comprehensive guide to web scraping with Scrapy in Python. Scrapy is a versatile tool that can handle everything from simple single-page scrapers to large distributed crawling projects. The key strengths are its simple but powerful extraction mechanisms, built-in handling of asynchronicity, and extensive options for post-processing and storing scraped data.
Some key topics we covered included:
- Scrapy‘s architecture and main components like spiders, pipelines, middlewares etc.
- Creating basic spiders by subclassing
scrapy.Spider
. - Using selector expressions to extract data from HTML and XML.
- Crawling multiple pages by using spiders like
CrawlSpider
. - Scraping JavaScript pages by integrating tools like Playwright.
- Storing scraped items in different formats or databases.
- Following best practices for effective crawling and avoiding bans.
- Advanced features like middlewares, caching, distributed crawling etc.
To summarize, Scrapy provides a robust framework for building production-grade web scrapers of any complexity. With a little care and planning, you can leverage Scrapy to extract data from almost any website out there. Happy scraping!