Scrapy Recipes

This page contains some code snippets to perform non-trivial tasks with Scrapy.

Keep in mind that most of this code is untested and unsupported. If you find any errors, please feel free to edit this page and fix them.

How to spoof requests to be HTTP 1.1 compliant

You can do this by overriding the Scrapy HTTP Client Factory, with the following (undocumented) setting:

DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.downloader.HTTPClientFactory'

Here's a possible implementation of myproject.downloader module:

from scrapy.core.downloader.webclient import ScrapyHTTPClientFactory, ScrapyHTTPPageGetter

class PageGetter(ScrapyHTTPPageGetter):

    def sendCommand(self, command, path):
        self.transport.write('%s %s HTTP/1.1\r\n' % (command, path))

class HTTPClientFactory(ScrapyHTTPClientFactory):

     protocol = PageGetter           

Avoid downloading pages which exceed a certain size

You can do this by overriding the Scrapy HTTP Client Factory, with the following (undocumented) setting:

DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.downloader.LimitSizeHTTPClientFactory'

Here's a possible implementation of myproject.downloader module:

MAX_RESPONSE_SIZE = 1048576 # 1Mb

from scrapy.core.downloader.webclient import ScrapyHTTPClientFactory, ScrapyHTTPPageGetter

class LimitSizePageGetter(ScrapyHTTPPageGetter):

    def handleHeader(self, key, value):
        ScrapyHTTPClientFactory.handleHeader(self, key, value)
        if key.lower() == 'content-size' and int(value) > MAX_RESPONSE_SIZE:
            self.connectionLost('oversized')

class LimitSizeHTTPClientFactory(ScrapyHTTPClientFactory):

     protocol = LimitSizePageGetter           

Scrapy in Buildout

Link: Buildout

Here is a sample buildout.cfg file that you can use to install scrapy. This will install scrapy source from tip of mercurial repository. You can replace http://hg.scrapy.org/scrapy/archive/tip.tar.gz with url of release tarball to install a release.

[buildout]
parts =
        scrapy
        commands

[scrapy]
recipe = taras.recipe.distutils
urls = http://hg.scrapy.org/scrapy/archive/tip.tar.gz

[commands]
recipe = zc.recipe.egg:scripts
eggs =
    scrapy
extra-paths =
     ${scrapy:extra-path}
entry-points =
    scrapy-ctl=scrapy.command.cmdline:execute

To try this buildout, follow the following steps.

  1. create directory and go into it
  2. run wget http://svn.zope.org/*checkout*/zc.buildout/trunk/bootstrap/bootstrap.py to download bootstrap
  3. copy the above code into buildout.cfg
  4. run python bootstrap.py
  5. run ./bin/buildout

This will generate ./bin/scrapy-ctl that you can run to create a project.

Login submiting a form before start actual crawling

This spider extends XMLFeedSpider whose extends InitSpider? that provides init_request method, you can extend CrawlSpider? that also extends InitSpider? to do the same on non-feed based sites. ipod nano 8gb

from scrapy.contrib.spiders import XMLFeedSpider
from scrapy.http import FormRequest, Request


KICKASS_USER = 'demologin@ghostmail.com'
KICKASS_PASS = 'demologin'


class KickasstorrentsSpider(XMLFeedSpider):

    domain_name = 'kickasstorrents.com'
    start_urls = ['http://www.kickasstorrents.com/movies/?rss=1']

    def init_request(self):
        return Request('http://www.kickasstorrents.com/account/login/', self._submit_login)

    def _submit_login(self, response):
        return FormRequest.from_response(response, formnumber=2, callback=self.initialized,
                formdata={'email': KICKASS_USER, 'password': KICKASS_PASS})

    def parse_item(self, response, xxs):
        torrentlink = xxs.select('torrentLink/text()').extract()[0]
        seeds = int(xxs.select('seeds/text()').extract()[0])
        if seeds:
            self.log('%s has %i seeds' % (torrentlink, seeds))
        return [] # fill and return torrent item here


SPIDER = KickasstorrentsSpider()

Persist scraped items using shove

The following pipeline can be used to persist scraped items using shove, a "new generation" shelve.

This pipeline uses two settings:

  • SHOVEITEM_STORE_URI - the URI of the Shove store
  • SHOVEITEM_STORE_OPTS - a dict containing options passed to Shove() constructor
from string import Template

from shove import Shove
from scrapy.xlib.pydispatch import dispatcher

from scrapy import log
from scrapy.core import signals
from scrapy.conf import settings
from scrapy.core.exceptions import NotConfigured

class ShoveItemPipeline(object):

    def __init__(self):
        self.uritpl = settings['SHOVEITEM_STORE_URI']
        if not self.uritpl:
            raise NotConfigured
        self.opts = settings['SHOVEITEM_STORE_OPTS'] or {}
        self.stores = {}

        dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
        dispatcher.connect(self.spider_closed, signal=signals.spider_closed)

    def process_item(self, spider, item):
        guid = str(item.guid)

        if guid in self.stores[spider]:
            if self.stores[spider][guid] == item:
                status = 'old'
            else:
                status = 'upd'
        else:
            status = 'new'

        if not status == 'old':
            self.stores[spider][guid] = item
        self.log(spider, item, status)
        return item

    def spider_opened(self, spider):
        uri = Template(self.uritpl).substitute(domain=spider.domain_name)
        self.stores[spider] = Shove(uri, **self.opts)

    def spider_closed(self, spider):
        self.stores[spider].sync()

    def log(self, spider, item, status):
        log.msg("Shove (%s): Item guid=%s" % (status, item.guid), level=log.DEBUG, \
            spider=spider)