Scrapy on Amazon EC2
This page contains some tips on running Scrapy on Amazon EC2 cloud computing platform.
This guide is a work in progress. Feel free to improve and/or correct it.
Architecture
The system is pretty simple and just consist in:
- a SQS to schedule domains
- a SQS poller process which pulls domains to scrape from SQS. The SQS poller is a service that runs in every EC2 instance used for crawling with Scrapy.
With this approach you can scale as much as you need, automatically, by adding more EC2 instances.
Code
WARNING: The code below is provided only as proof of concept, which means it hasn't been tested and can be improved in several ways.
Below you will find the code of a crawler.tac file that can be used with the twistd command (included in Twisted, so you already have it), to setup a service which polls Amazon SQS and runs scraping jobs. It uses boto for communication with SQS. It also runs the SQS queue poller in a separate thread to avoid blocking IO from affecting the non-blocking Scrapy IO.
See this page for more information on writing services with Twisted and twistd.
crawler.tac
from twisted.application import service, internet from twisted.internet import defer, threads from boto.sqs.connection import SQSConnection from scrapy.core.manager import scrapymanager from scrapy.core.engine import scrapyengine from scrapy import log POLL_INTERVAL = 30 QUEUE_NAME = 'scrapy_domains' class DomainPoller(object): def __init__(self): conn = boto.connect_sqs() self.queue = conn.create_queue(QUEUE_NAME) def poll(self): if threads.blockingCallFromThread(scrapyengine.downloader.has_capacity): msgs = self.queue.get_messages() if msgs: domain = msgs[0].get_body() return threads.blockingCallFromThread(scrapymanager.crawl, domain) def get_application(): application = service.Application("Crawling bot") poller = DomainPoller() service = internet.TimerService(POLL_INTERVAL, threads.deferToThread, poller.poll) service.setServiceParent(application) return application application = get_application()
