Scrapy Recipes
Recipes
This page contains some code snippets to perform non-trivial tasks with Scrapy.
Keep in mind that most of this code is untested and unsupported. If you find any errors, please feel free to edit this page and fix them.
How to spoof requests to be HTTP 1.1 compliant
You can do this by overriding the Scrapy HTTP Client Factory, with the following (undocumented) setting:
DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.downloader.HTTPClientFactory'
Here's a possible implementation of myproject.downloader module:
from scrapy.core.downloader.webclient import ScrapyHTTPClientFactory, ScrapyHTTPPageGetter class PageGetter(ScrapyHTTPPageGetter): def sendCommand(self, command, path): self.transport.write('%s %s HTTP/1.1\r\n' % (command, path)) class HTTPClientFactory(ScrapyHTTPClientFactory): protocol = PageGetter
Avoid downloading pages which exceed a certain size
You can do this by overriding the Scrapy HTTP Client Factory, with the following (undocumented) setting:
DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.downloader.LimitSizeHTTPClientFactory'
Here's a possible implementation of myproject.downloader module:
MAX_RESPONSE_SIZE = 1048576 # 1Mb from scrapy.core.downloader.webclient import ScrapyHTTPClientFactory, ScrapyHTTPPageGetter class LimitSizePageGetter(ScrapyHTTPPageGetter): def handleHeader(self, key, value): ScrapyHTTPClientFactory.handleHeader(self, key, value) if key.lower() == 'content-size' and int(value) > MAX_RESPONSE_SIZE: self.connectionLost('oversized') class LimitSizeHTTPClientFactory(ScrapyHTTPClientFactory): protocol = LimitSizePageGetter
Scrapy in Buildout
Link: Buildout
Here is a sample buildout.cfg file that you can use to install scrapy. This will install scrapy source from tip of mercurial repository. You can replace http://hg.scrapy.org/scrapy/archive/tip.tar.gz with url of release tarball to install a release.
[buildout] parts = scrapy commands [scrapy] recipe = taras.recipe.distutils urls = http://hg.scrapy.org/scrapy/archive/tip.tar.gz [commands] recipe = zc.recipe.egg:scripts eggs = scrapy extra-paths = ${scrapy:extra-path} entry-points = scrapy-ctl=scrapy.command.cmdline:execute
To try this buildout, follow the following steps.
- create directory and go into it
- run wget http://svn.zope.org/*checkout*/zc.buildout/trunk/bootstrap/bootstrap.py to download bootstrap
- copy the above code into buildout.cfg
- run python bootstrap.py
- run ./bin/buildout
This will generate ./bin/scrapy-ctl that you can run to create a project.
Login submiting a form before start actual crawling
This spider extends XMLFeedSpider whose extends InitSpider? that provides init_request method, you can extend CrawlSpider? that also extends InitSpider? to do the same on non-feed based sites. ipod nano 8gb
from scrapy.contrib.spiders import XMLFeedSpider from scrapy.http import FormRequest, Request KICKASS_USER = 'demologin@ghostmail.com' KICKASS_PASS = 'demologin' class KickasstorrentsSpider(XMLFeedSpider): domain_name = 'kickasstorrents.com' start_urls = ['http://www.kickasstorrents.com/movies/?rss=1'] def init_request(self): return Request('http://www.kickasstorrents.com/account/login/', self._submit_login) def _submit_login(self, response): return FormRequest.from_response(response, formnumber=2, callback=self.initialized, formdata={'email': KICKASS_USER, 'password': KICKASS_PASS}) def parse_item(self, response, xxs): torrentlink = xxs.select('torrentLink/text()').extract()[0] seeds = int(xxs.select('seeds/text()').extract()[0]) if seeds: self.log('%s has %i seeds' % (torrentlink, seeds)) return [] # fill and return torrent item here SPIDER = KickasstorrentsSpider()
Persist scraped items using shove
The following pipeline can be used to persist scraped items using shove, a "new generation" shelve.
This pipeline uses two settings:
- SHOVEITEM_STORE_URI - the URI of the Shove store
- SHOVEITEM_STORE_OPTS - a dict containing options passed to Shove() constructor
from string import Template from shove import Shove from scrapy.xlib.pydispatch import dispatcher from scrapy import log from scrapy.core import signals from scrapy.conf import settings from scrapy.core.exceptions import NotConfigured class ShoveItemPipeline(object): def __init__(self): self.uritpl = settings['SHOVEITEM_STORE_URI'] if not self.uritpl: raise NotConfigured self.opts = settings['SHOVEITEM_STORE_OPTS'] or {} self.stores = {} dispatcher.connect(self.spider_opened, signal=signals.spider_opened) dispatcher.connect(self.spider_closed, signal=signals.spider_closed) def process_item(self, spider, item): guid = str(item.guid) if guid in self.stores[spider]: if self.stores[spider][guid] == item: status = 'old' else: status = 'upd' else: status = 'new' if not status == 'old': self.stores[spider][guid] = item self.log(spider, item, status) return item def spider_opened(self, spider): uri = Template(self.uritpl).substitute(domain=spider.domain_name) self.stores[spider] = Shove(uri, **self.opts) def spider_closed(self, spider): self.stores[spider].sync() def log(self, spider, item, status): log.msg("Shove (%s): Item guid=%s" % (status, item.guid), level=log.DEBUG, \ spider=spider)
