Deploying a Scrapy crawler on Amazon EC2

Note: These instructions have only been tested with Scrapy 0.9. They probably won't work "as is" on trunk (0.10), but they still provide general guidelines for deploying your Scrapy crawler.

Introduction

This is a guide for deploying a Scrapy crawler in Amazon EC2 cloud computing platform, using a simple and flexible architecture for distributed crawling and the facilities provided by Scrapy to run it as a service.

Even though this guide is about deploying Scrapy on EC2, it could be used as reference for deployments in other hostings or cloud services.

Requirements

  • Scrapy 0.9
  • boto
  • An Amazon Web Services (AWS) account with your AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)

Architecture

The system is pretty simple and just consist in:

  1. a SQS to schedule spiders
  2. a SQS poller process which pulls spiders to scrape from SQS. The SQS poller is a service that runs in every EC2 instance used for crawling with Scrapy.

With this approach you can scale as much as you need, automatically, by adding more EC2 instances.

Deployment Procedure

Launch a new server

Launch a new EC2 instance with Ubuntu 10.04 AMI. See this site for a list of available AMIs for Ubuntu on each region.

Install Scrapy

After your instance has started, log into it and add the Scrapy APT repos as described in AptRepos

Then install Scrapy with: apt-get install scrapy

Deploy and configure your project

Deploy your Scrapy project code in /var/lib/scrapy/myproject. Your project settings should reside in /var/lib/scrapy/myproject/myproject/settings.py. That settings.py file should import a local_settings.py file at the end to allow overriding settings with local server settings.

try:
    from local_settings import *
except ImportError:
    pass

Then, create a /etc/scrapy/local_settings.py file specifying that you want to use SQS for pulling spiders to scrape:

SERVICE_QUEUE = 'scrapy.contrib.queue.sqs.SQSExecutionQueue'

Configure Scrapy service

Edit /etc/scrapy/environment and add your AWS credentials, and the PYTHONPATH to find your project module:

export AWS_ACCESS_KEY_ID="XXXXX"
export AWS_SECRET_ACCESS_KEY="YYYYY"

export PYTHONPATH=/var/scrapy/lib/myproject

Add your project to the list of projects managed by the Scrapy service by adding the following lines to /etc/scrapy/service_conf.py:

PROJECTS = {
    'myproject.settings': 0,
}

The value 0 tells the service to start as many local crawlers as cpu cores available in the local server.

Starting, stopping and monitoring the Scrapy service

Some facts about the Scrapy service:

  • it starts one process per project defined in the PROJECTS setting using the scrapy-ctl.py start command
  • it re-spawns crawler processes if they die
  • it uses upstart for the control scripts
  • it runs under the scrapy user
  • it logs to /var/log/scrapy/service.log
  • it makes each project crawler process log into a different file in the directory specified in the LOG_DIR setting in /etc/scrapy/service_conf.py

To start the Scrapy service use:

$ sudo start scrapy
scrapy start/running, process 5737

To stop the Scrapy service use:

$ sudo stop scrapy
scrapy stop/waiting

To check the status of Scrapy service use:

$ sudo status scrapy
scrapy start/running, process 5737

You can also check the logs of the service of any crawler process in /var/log/scrapy:

$ ls -l /var/log/scrapy/
total 1460K
-rw-rw-r-- 1 scrapy nogroup   65125 2010-06-14 12:58 myproject-1.log
-rw-rw-r-- 1 scrapy nogroup   81020 2010-06-14 12:58 myproject.log
-rw-r--r-- 1 root   root     213074 2010-06-14 12:58 service.log
-rw-r--r-- 1 root   root    1110781 2010-06-10 14:44 service.log.1

Scheduling spiders to run

To schedule spiders to run you just need to add messages to the SQS queue. You'll typically do this from a script (perhaps a periodically run cron script) or an administration UI, triggered by manual events.

There is an example script provided in Scrapy (bin/scrapy-sqs.py) to illustrate how to populate the queue with spiders to scrape. If you have deployed Scrapy using the APT repos you'll have a scrapy-sqs command available in your system that you can use like this:

$ scrapy-sqs put domain1.com
$ scrapy-sqs put domain2.com
$ scrapy-sqs put domain3.com

That will schedule 3 spiders into the SQS queue: domain1.com, domain2.com, domain3.com. The crawlers service monitoring the queue will pull them and start running the spiders.

Multiple projects

The Scrapy service allows you to run more than one Scrapy project, you just need to specify the settings module and how many instances (processes) you want to run for each project. So if you have 3 Scrapy projects that you want to run in a 4-core machine, you could use these lines in service_conf.py:

PROJECTS = {
    'myproject1.settings': 2,
    'myproject2.settings': 1,
    'myproject3.settings': 1,
}

That configuration tells the Scrapy service to start two crawler/processes for myproject1, one for myproject2 and one for myproject3.

It is recommended that the sum of process instances don't exceed the number of cores available in your system, unless you're certain that some of the projects will be quite idle.

Each project would use a different SQS queue, defined in its configuration as using the SQS_QUEUE setting.