SEP-001 - New Adaptors API
Introduction
This page presents different usage scenarios for the two proposed adaptors API (ItemAdaptor, ItemForm) and compares them with the old deprecated RobustItem. The best one will be chosen as then new mechanism for populating items with scraped data.
Contenders
- RobustItem
- ItemAdaptor
- ItemForm
Usage Scenarios
- Defining adaptors
- Creating an Item
- Using different adaptors per Spider/Site
- Checking the value of a beign-extracted item
- Adding a value to a list attribute/field
- Passsing run-time arguments to adaptors
Defining adaptors
RobustItem
ItemAdaptor
class NewsAdaptor(ItemAdaptor): item_class = NewsItem url = adaptor(extract, remove_tags(), unquote(), strip) headline = adaptor(extract, remove_tags(), unquote(), strip) summary = adaptor(extract, remove_tags(), unquote(), strip) content = adaptor(extract, remove_tags(), unquote(), strip)
ItemForm
class NewsForm(ItemForm): url = adaptor(extract, remove_tags(), unquote(), strip) headline = adaptor(extract, remove_tags(), unquote(), strip) summary = adaptor(extract, remove_tags(), unquote(), strip) content = adaptor(extract, remove_tags(), unquote(), strip)
- Item form doesn't know about items, maybe just Form :)
Creating an Item
RobustItem
ItemAdaptor
ia = NewsAdaptor(response) ia.url = response.url ia.headline = xhs.x('//h1[@class="headline"]') ia.summary = xhs.x('//div[@class="summary"]') ia.content = xhs.x('//div[@id="body"]') return ia.item_instance
ItemForm
nf = NewsForm(response) nf.url = response.url nf.headline = xhs.x('//h1[@class="headline"]') nf.summary = xhs.x('//div[@class="summary"]') nf.content = xhs.x('//div[@id="body"]') return NewsItem(nf.as_dict())
- NewsForm can be used to initialize any kind of Item or a portion of an Item
Using different adaptors per Spider/Site
RobustItem
ItemAdaptor
class SiteNewsAdaptor(NewsAdaptor): published = adaptor(HtmlNewsAdaptor.published, to_date('%d.%m.%Y'))
ItemForm
class SiteNewsFrom(NewsForm): published = adaptor(HtmlNewsAdaptor.published, to_date('%d.%m.%Y'))
Checking the value of a beign-extracted item
RobustItem
ItemAdaptor
ia = NewsAdaptor(response) ia.headline = xhs.x('//h1[@class="headline"]') nf.not ia.headline: ia.headline = xhs.x('//h1[@class="title"]')
ItemForm
nf = NewsForm(response) nf.headline = xhs.x('//h1[@class="headline"]') if not nf.headline: nf.headline = xhs.x('//h1[@class="title"]')
