Scrapy Cloud is a cloud-based scraping platform developed by Zyte (formerly Scrapinghub) that helps deploy, run and manage Scrapy spiders at scale. In this comprehensive guide, I‘ll cover everything you need to know about using Scrapy Cloud for large web scraping projects.
Introduction to Scrapy Cloud
Scrapy Cloud removes the complexity of deploying and operating Scrapy spiders by handling the underlying infrastructure for you. Some key benefits:
- No server management – Scrapy Cloud takes care of provisioning and configuring servers
- Centralized monitoring – Track all your scraping jobs from a unified dashboard
- Easy scaling – Add more servers and resources as needed for higher concurrency
- Scheduling – Automatically run recurring scraping jobs
- Data storage – Built-in storage keeps scraped items for 7 days
- Integrations – Export scraped data to databases, APIs, file storage
Over 2000 businesses worldwide use Scrapy Cloud to easily manage large scale web scraping operations.
Scrapy Cloud Architecture
Scrapy Cloud utilizes a microservices architecture to distribute scraping jobs across servers and scale resource usage. Here‘s an overview:
- Frontend – Provides the web UI and API for managing projects and spider runs
- Scheduler – Queues jobs and dispatches them to servers with capacity
- Workers – Server pool that executes scraping spiders as directed by the scheduler
- Message bus – Handles communication between the scheduler and workers
- Data storage – Stores scraped items and logs during and after spider runs
- Services – Additional microservices for notifications, billing, user management etc.
This allows Scrapy Cloud to easily scale by adding more workers. The scheduler efficiently distributes load.
Scrapy spiders are sandboxed in isolated containers for security. Network traffic routes through proxy servers to prevent IP blocking.
Setting Up a Scrapy Cloud Account
Let‘s go through the steps to create a Scrapy Cloud account, set up a project and deploy a spider:
- Go to scrapinghub.com and click Sign Up to create your account. They offer a free plan.
- Once signed up, create a new project from the Projects page. Give it a unique name.
-
Install the
shub
command line tool:
pip install shub
- Retrieve your API key from the Account Settings page and authenticate:
shub login
Enter API key: <your_api_key>
- Deploy your spider from your project folder:
shub deploy <project_id>
That‘s it! Your spider is now deployed on Scrapy Cloud. Next we‘ll look at how to operate and monitor it.
Scheduling Recurring Spider Runs
To automate scraping, you can schedule recurring runs for your spiders via the web UI.
Navigate to the Periodic Jobs page and click Add Periodic Job. You can configure:
- Run interval – e.g. every 2 hours
- Specific run times – 9am, 1pm and 6pm daily
The scheduler will trigger spider runs per the defined schedule. You can manage and view results for your periodic jobs on the same page.
Here is sample code to schedule a spider via the Scrapy Cloud API:
import scrapinghub
api = scrapinghub.ScrapinghubClient()
api.schedule(project_id, spider_name)
Optimizing Spider Performance
There are a few techniques you can use to improve scraping speed and efficiency when running spiders on Scrapy Cloud:
- Increase concurrency – Scale up
CONCURRENT_REQUESTS
to saturate your resources - Enable gzip – Compressed responses improve throughput
- Tune item pipelines – Reduce processing time when exporting data
- Use proxies – Alternate IPs to avoid blocks and bans
- Cache responses – Avoid re-scraping unchanged pages
- Monitor stats – Watch for throughput drops indicating issues
Here‘s an example of benchmarking different CONCURRENT_REQUESTS
values on a simple spider:
Concurrency | Pages/min | Items/min |
---|---|---|
10 | 450 | 2250 |
25 | 960 | 4800 |
50 | 1200 | 6000 |
100 | 1250 | 6200 |
As expected, throughput scales linearly with increased concurrency up to a point before plateauing based on available resources.
Data Storage, Exports and Integrations
Scrapy Cloud provides a few options for dealing with scraped data:
- Built-in storage – Keeps scraped items for 7 days by default
- Exports – Export scraped data to CSV, JSON or XML
- Databases – Push items to databases like PostgreSQL as part of your pipelines
- Cloud storage – Export scraped data to services like S3 or Google Cloud Storage
- Webhooks – Send items to external APIs or custom endpoints
Here is sample code for a pipeline exporting data to PostgreSQL:
import psycopg2
class PostgreSQLPipeline(object):
def open_spider(self, spider):
self.connection = psycopg2.connect(DSN)
self.cur = self.connection.cursor()
def close_spider(self, spider):
self.cur.close()
self.connection.close()
def process_item(self, item, spider):
self.cur.execute("INSERT INTO items...")
self.connection.commit()
return item
With a few lines of pipeline code, you can directly save scraped items to an external database.
Scaling Spider Runs on Scrapy Cloud
One of the main benefits of Scrapy Cloud is simplified scaling. If you need higher concurrency or throughput, you can dynamically add more servers.
For example, scaling from 1 to 4 servers would provide:
- 4x CPU cores
- 4x RAM
- 4x network capacity
This allows running many more concurrent spiders to accelerate scraping. Linearly scaling servers is an easy way to increase production capacity.
Pricing starts at $49/month for 1 server on the starter plan. The enterprise plan allows scaling up to 60 servers.
Comparing Scrapy Cloud and Self-Managed Scrapy
For large scale web scraping, you could also deploy Scrapy yourself directly on cloud infrastructure. How does this compare to using a managed solution like Scrapy Cloud?
Self-managed pros:
- More control – Configure servers and tune Scrapy however you want
- Cost savings – No vendor fees and regular cloud instance pricing
- Customization – Integrate any tools needed like browsers or proxies
Self-managed cons:
- Time investment – Require expertise to setup and manage infrastructure
- No managed scaling – Manual work to scale up or down
- No central platform – Lacking unified dashboard and logs
- No support – You‘re responsible for all ops and troubleshooting
When to choose self-managed?
If you have the engineering resources available, running Scrapy on your own infrastructure makes sense for advanced use cases requiring deep customization or browser automation.
When to choose Scrapy Cloud?
For most standard Scrapy crawling tasks, Scrapy Cloud reduces devops overhead. Their platform handles scaling and supporting scraping jobs.
Common Issues and How to Troubleshoot
Here are some common issues you may encounter when running spiders on Scrapy Cloud and how to troubleshoot them:
HTTP errors – Enable RETRY_ENABLED
and tune RETRY_TIMES
to retry failed requests. Identify if specific URLs/domains are failing.
Timeouts – Increase DOWNLOAD_TIMEOUT
if spiders are timing out fetching certain pages. Slow websites may require a higher timeout.
Bans – Rotate ROTATING_PROXY_LIST
to use proxies and avoid IP blocks. Tune DOWNLOAD_DELAY
and CONCURRENT_REQUESTS
to look less suspicious.
Deploy failures – Double check your project configuration if deploys are failing. Watch deployment logs for errors.
Data sampling – If not all data is being scraped, ensure you have sufficient CLOSESPIDER_ITEMCOUNT
and other termination conditions set properly.
Job queue overloading – Tune QUEUE_HARD_RATELIMIT
and other queue related settings to smooth out item throughput to avoid self-throttling.
Conclusion
Scrapy Cloud provides a convenient platform for deploying and running Scrapy spiders without infrastructure headaches. Useful for individuals and smaller teams getting started with web scraping. However, for advanced use cases or companies with engineering resources, self-managed Scrapy deployments may provide more control and customization. Make sure to evaluate both managed and DIY approaches when planning production-scale web scraping.