Deploying a Scrapy crawler on Amazon EC2
Note: These instructions have only been tested with Scrapy 0.9. They probably won't work "as is" on trunk (0.10), but they still provide general guidelines for deploying your Scrapy crawler.
Introduction
This is a guide for deploying a Scrapy crawler in Amazon EC2 cloud computing platform, using a simple and flexible architecture for distributed crawling and the facilities provided by Scrapy to run it as a service.
Even though this guide is about deploying Scrapy on EC2, it could be used as reference for deployments in other hostings or cloud services.
Requirements
- Scrapy 0.9
- boto
- An Amazon Web Services (AWS) account with your AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
Architecture
The system is pretty simple and just consist in:
- a SQS to schedule spiders
- a SQS poller process which pulls spiders to scrape from SQS. The SQS poller is a service that runs in every EC2 instance used for crawling with Scrapy.
With this approach you can scale as much as you need, automatically, by adding more EC2 instances.
Deployment Procedure
Launch a new server
Launch a new EC2 instance with Ubuntu 10.04 AMI. See this site for a list of available AMIs for Ubuntu on each region.
Install Scrapy
After your instance has started, log into it and add the Scrapy APT repos as described in AptRepos
Then install Scrapy with: apt-get install scrapy
Deploy and configure your project
Deploy your Scrapy project code in /var/lib/scrapy/myproject. Your project settings should reside in /var/lib/scrapy/myproject/myproject/settings.py. That settings.py file should import a local_settings.py file at the end to allow overriding settings with local server settings.
try: from local_settings import * except ImportError: pass
Then, create a /etc/scrapy/local_settings.py file specifying that you want to use SQS for pulling spiders to scrape:
SERVICE_QUEUE = 'scrapy.contrib.queue.sqs.SQSExecutionQueue'
Configure Scrapy service
Edit /etc/scrapy/environment and add your AWS credentials, and the PYTHONPATH to find your project module:
export AWS_ACCESS_KEY_ID="XXXXX" export AWS_SECRET_ACCESS_KEY="YYYYY" export PYTHONPATH=/var/scrapy/lib/myproject
Add your project to the list of projects managed by the Scrapy service by adding the following lines to /etc/scrapy/service_conf.py:
PROJECTS = { 'myproject.settings': 0, }
The value 0 tells the service to start as many local crawlers as cpu cores available in the local server.
Starting, stopping and monitoring the Scrapy service
Some facts about the Scrapy service:
- it starts one process per project defined in the PROJECTS setting using the scrapy-ctl.py start command
- it re-spawns crawler processes if they die
- it uses upstart for the control scripts
- it runs under the scrapy user
- it logs to /var/log/scrapy/service.log
- it makes each project crawler process log into a different file in the directory specified in the LOG_DIR setting in /etc/scrapy/service_conf.py
To start the Scrapy service use:
$ sudo start scrapy
scrapy start/running, process 5737
To stop the Scrapy service use:
$ sudo stop scrapy
scrapy stop/waiting
To check the status of Scrapy service use:
$ sudo status scrapy
scrapy start/running, process 5737
You can also check the logs of the service of any crawler process in /var/log/scrapy:
$ ls -l /var/log/scrapy/
total 1460K
-rw-rw-r-- 1 scrapy nogroup 65125 2010-06-14 12:58 myproject-1.log
-rw-rw-r-- 1 scrapy nogroup 81020 2010-06-14 12:58 myproject.log
-rw-r--r-- 1 root root 213074 2010-06-14 12:58 service.log
-rw-r--r-- 1 root root 1110781 2010-06-10 14:44 service.log.1
Scheduling spiders to run
To schedule spiders to run you just need to add messages to the SQS queue. You'll typically do this from a script (perhaps a periodically run cron script) or an administration UI, triggered by manual events.
There is an example script provided in Scrapy (bin/scrapy-sqs.py) to illustrate how to populate the queue with spiders to scrape. If you have deployed Scrapy using the APT repos you'll have a scrapy-sqs command available in your system that you can use like this:
$ scrapy-sqs put domain1.com $ scrapy-sqs put domain2.com $ scrapy-sqs put domain3.com
That will schedule 3 spiders into the SQS queue: domain1.com, domain2.com, domain3.com. The crawlers service monitoring the queue will pull them and start running the spiders.
Multiple projects
The Scrapy service allows you to run more than one Scrapy project, you just need to specify the settings module and how many instances (processes) you want to run for each project. So if you have 3 Scrapy projects that you want to run in a 4-core machine, you could use these lines in service_conf.py:
PROJECTS = { 'myproject1.settings': 2, 'myproject2.settings': 1, 'myproject3.settings': 1, }
That configuration tells the Scrapy service to start two crawler/processes for myproject1, one for myproject2 and one for myproject3.
It is recommended that the sum of process instances don't exceed the number of cores available in your system, unless you're certain that some of the projects will be quite idle.
Each project would use a different SQS queue, defined in its configuration as using the SQS_QUEUE setting.
