Using Parsley and Parselets with Scrapy
"Parsley is a simple language for extracting structured data from web pages. It consists of an powerful Selector Language wrapped with a JSON Structure that can represent page-wide formatting."
We can get Parsley language site parsers (parselets) from Parselets site.
"Parselets.com is a central repository for user-created APIs to the web, called Parselets. Parselets are snippets of parsing code written in a language called Parsley, which is a familiar combination of CSS, XPath, Regular Expressions, and JSON."
In this example, we integrate Parsley with Scrapy using a new class of Item, ParsleyItem that defines its fields from a parselet code, and extend the CrawlSpider to create ParsleySpider that provides a method to parse a response with a parselet and return a ParsleyItem.
The ParsleyItem and ParsleySpider code:
from pyparsley import PyParsley from scrapy.contrib.spiders import CrawlSpider from scrapy.item import Item, Field class ParsleyItem(Item): def __init__(self, parslet_code, *args, **kwargs): for name in parslet_code.keys(): self.fields[name] = Field() super(ParsleyItem, self).__init__(*args, **kwargs) class ParsleySpider(CrawlSpider): parslet_code = {} def parse_parsley(self, response): parslet = PyParsley(self.parslet_code, output='python') return ParsleyItem(self.parslet_code, parslet.parse(string=response.body))
To show this classes usage, we will reimplement the Youtube Spider example from Community Spiders.
from scrapy.conf import settings from scrapy.contrib.spiders import Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor YOUTUBE_PARSLET = { "title": "h1", "desc": ".description", "rating": ".ratingL @title", "embed": "#embed_code @value" } class YoutubeSpider(ParsleySpider): query = settings.get('QUERY') domain_name = 'youtube.com' start_urls = ['http://www.youtube.com/results?search_query=%s&page=1' % query] rules = ( Rule(SgmlLinkExtractor(allow=(r'results\?search_query=%s&page=\d+' % query,))), Rule(SgmlLinkExtractor(allow=(r'watch\?v=',), restrict_xpaths=['//div[@id="results-main-content"]']), 'parse_parsley'), ) parslet_code = YOUTUBE_PARSLET SPIDER = YoutubeSpider()
