Web scraping is the process of extracting structured data from websites automatically through scripts and tools instead of copying and pasting manually. It has grown enormously in popularity due to the wealth of data available online. However, scraping larger sites manually takes up significant time and effort.
This is where automating scrapers comes in – it allows continuously running scrapers efficiently at any required schedule unattended. According to recent surveys, over 60% of data professionals use web scraping in their workflows and 85% of them utilize automation. Automated scraping unlocks huge time and cost benefits compared to manual scraping.
In this comprehensive guide, we will focus on scheduling Python scrapers using cron jobs to run unattended on Linux and Unix-like systems.
Why Automate Web Scraping?
Here are some key benefits of automating web scraping:
- Eliminates repetitive manual work – With automation, scrapers can run 24/7 without any user input required. This frees up significant time and effort.
- Enables scalability – Automated scrapers make it easy to scale up and scrape thousands of pages from large sites. Difficult with just manual scraping.
- Overcomes limitations – Automation can help overcome human limitations like getting blocked or captchas.
- Better data quality – Automated scraping provides structured, clean data. Manual copy-pasting data can result in errors.
- Increased efficiency – Automation allows scrapers to continuously run and quickly gather data as it appears instead of needing to run them manually.
According to surveys, over 70% of companies using web scraping employ some form of automation. The time savings allow using scrapers more optimally to get data faster.
Why Use Python for Web Scraping?
When it comes to coding scrapers, Python stands out as one of the most popular choices due to its simplicity and extensive libraries focused on scraping. Some key advantages:
- Easy to learn – With simple, readable syntax, Python is one of the easiest languages for beginners to pickup compared to Java, Ruby, PHP etc. This allows coding scrapers more quickly.
- Powerful scraping libraries – Python has many libraries like BeautifulSoup, Scrapy, Selenium, Requests etc exclusively focused on scraping making it easy to code scrapers rapidly.
- Vibrant community – As one of the most popular languages, Python enjoys great community support. This results in lots of code examples, tutorials, and libraries for scraping.
- Multi-purpose – Python excels at automation beyond just web scraping like processing data, machine learning applications etc. Making it useful for end-to-end scraping solutions.
- Portable and platform independent – Python runs on Linux, Windows and macOS. Scrapers work cross-platform without changes.
Python‘s suitability for automation along with its scraping libraries like Scrapy and Selenium make it a versatile choice for automated scraping. Now let‘s see how we can schedule Python scrapers using cron.
Introducing Cron Jobs
Cron is a utility designed specially for scheduling tasks to run on Unix-like systems automatically at fixed intervals. It works by allowing users to specify timings for commands to run in a cron table (crontab).
Cron then daemon process checks the crontab at every minute to see if any scheduled commands need to be run at that time. If a match is found, cron executes those commands.
Some key aspects of cron:
- Checks crontab every minute for matching scheduled tasks and runs them
- Can be used to schedule Python scripts, shell scripts, or any other programs
- Timing is customizable – can schedule tasks to run down to the granularity of a minute
- Widely used for automating routine tasks like system maintenance, backups, reports etc
Next, we‘ll see how we can schedule Python scrapers using crontab.
Creating Cron Jobs
Cron jobs are created by adding entries to the crontab which specify:
- Time schedule for running the job
- Command to execute
Crontab entries have the following syntax:
# ┌────────────── minute (0 - 59)
# │ ┌────────────── hour (0 - 23)
# │ │ ┌────────────── day of month (1 - 31)
# │ │ │ ┌────────────── month (1 - 12)
# │ │ │ │ ┌────────────── day of week (0 - 6) (Sunday=0 or 7)
# │ │ │ │ │
# * * * * * command to be executed
For example, this cron job runs at 10 am every day:
0 10 * * * /home/user/script.py
The first 0 specifies the minute, 10 is the hour, and * means every value for remaining fields. The last part is the script path.
To create a cron job, the crontab -e
command is used which opens up the cron table in the default text editor to add jobs.
Sample Python Scraper
Let‘s take an example scraper that fetches book details from a book listing and saves them to a CSV file:
import requests
from bs4 import BeautifulSoup
import csv
URL = ‘http://books.toscrape.com/catalogue/the-grand-design_405/index.html‘
def scrape_book():
r = requests.get(URL)
soup = BeautifulSoup(r.content, ‘html.parser‘)
title = soup.find(‘h1‘).text
price = soup.find(class_=‘price_color‘).text
with open(‘books.csv‘, ‘a‘) as csv_file:
writer = csv.writer(csv_file)
writer.writerow([title, price])
if __name__ == ‘__main__‘:
scrape_book()
Now we can schedule this to run automatically using cron.
Scheduling the Python Scraper with Cron
To schedule the above scraper to run every 2 hours we need to:
- Specify the cron schedule
- Add the command to run the script
The schedule to run every 2 hours starting at midnight is:
0 */2 * * *
The */2 in the hour field means run every 2 hours.
Our complete cron job then is:
0 */2 * * * /home/user/scrape_books.py
This will run the scraper at 00:00, 02:00, 04:00 and so on.
We can now simply save this in the crontab using crontab -e
and the scraper will run automatically!
Common Crontab Scheduling Examples
Beyond running jobs every few hours, crontab schedules can also be specified for:
-
Daily – Run once at midnight:
0 0 * * * script.py
-
Weekly – Run on Sundays:
0 0 * * 0 script.py
-
Monthly – Run on 1st of every month:
0 0 1 * * script.py
-
Weekdays – Run Monday to Friday:
0 0 * * 1-5 script.py
-
Hourly – Run every hour:
0 * * * * script.py
There are many handy online cron schedule generators like Crontab Guru that can help build and visualize schedules.
Cron Best Practices
When automating Python scripts with cron, some best practices include:
- Use absolute script paths in crontab instead of relative ones
- Output logs from the scraper to simplify troubleshooting
- Use virtual environments to avoid dependency issues
- Make sure cron has execute permissions for running the script
-
Restart cron daemon occasionally –
sudo systemctl restart cron
- Validate crontab entries using online tools before adding jobs
- Use tools like logrotate, fail2ban to manage logs and prevent crashes
Properly structuring scripts and having good monitoring and alerting helps minimize cron errors.
Choosing Between Cron and Other Automation Tools
Beyond cron, there are also other automation tools that could be used for scheduling Python scrapers:
Airflow – Airflow has more features for workflow management. Useful for complex data pipelines.
Windows Task Scheduler – Easy to use GUI scheduler on Windows servers. Does not require crontab syntax.
Django – Django apps can schedule jobs using celery beat or APScheduler. Provides monitoring interface.
Luigi – Luigi manages dependencies between long running batch processes and pipelines.
Compared to the above tools, cron provides a simpler way to get started with automation requiring only some basic Linux sysadmin knowledge. For complex scenarios, the other tools may be more suitable.
Conclusion
Automating scrapers by scheduling cron jobs is a straightforward way to run scrapers efficiently without constant manual intervention. Cron along with Python provides a great toolbox for creating scrapers that can run 24/7 to extract data.
By following best practices around structure, logging, permissions and monitoring, cron can be leveraged to build robust long-running scrapers. The ability to customize schedules unlocks the flexibility to automate scrapers as needed – hourly, daily, weekly etc.
While this covers the basics, there are many additional techniques that can be used on top like using queues for distributed scraping, browsers for JavaScript heavy sites and proxies/APIs for large scale automation.
Hopefully this guide has provided a good overview of how to leverage Python and cron for creating automated web scrapers. Let me know if you have any other questions!