Scrapy on Amazon EC2

This page contains some tips on running Scrapy on Amazon EC2 cloud computing platform.

This guide is a work in progress. Feel free to improve and/or correct it.

Architecture

The system is pretty simple and just consist in:

  1. a SQS to schedule domains
  2. a SQS poller process which pulls domains to scrape from SQS. The SQS poller is a service that runs in every EC2 instance used for crawling with Scrapy.

With this approach you can scale as much as you need, automatically, by adding more EC2 instances.

Code

WARNING: The code below is provided only as proof of concept, which means it hasn't been tested and can be improved in several ways.

Below you will find the code of a crawler.tac file that can be used with the twistd command (included in Twisted, so you already have it), to setup a service which polls Amazon SQS and runs scraping jobs. It uses boto for communication with SQS. It also runs the SQS queue poller in a separate thread to avoid blocking IO from affecting the non-blocking Scrapy IO.

See this page for more information on writing services with Twisted and twistd.

crawler.tac

from twisted.application import service, internet
from twisted.internet import defer, threads

from boto.sqs.connection import SQSConnection

from scrapy.core.manager import scrapymanager
from scrapy.core.engine import scrapyengine
from scrapy import log

POLL_INTERVAL = 30
QUEUE_NAME = 'scrapy_domains'

class DomainPoller(object):

    def __init__(self):
        conn = boto.connect_sqs()
        self.queue = conn.create_queue(QUEUE_NAME)

    def poll(self):
        if threads.blockingCallFromThread(scrapyengine.downloader.has_capacity):
            msgs = self.queue.get_messages()
            if msgs:
                domain = msgs[0].get_body()
                return threads.blockingCallFromThread(scrapymanager.crawl, domain)

def get_application():
    application = service.Application("Crawling bot")
    poller = DomainPoller()
    service = internet.TimerService(POLL_INTERVAL, threads.deferToThread, poller.poll)
    service.setServiceParent(application)
    return application


application = get_application()