SEP-001 - New Adaptors API

Introduction

This page presents different usage scenarios for the two proposed adaptors API (ItemAdaptor, ItemForm) and compares them with the old deprecated RobustItem. The best one will be chosen as then new mechanism for populating items with scraped data.

Contenders

  • RobustItem
  • ItemAdaptor
  • ItemForm

Usage Scenarios

Defining adaptors

RobustItem

ItemAdaptor

class NewsAdaptor(ItemAdaptor):
    item_class = NewsItem

    url = adaptor(extract, remove_tags(), unquote(), strip)
    headline = adaptor(extract, remove_tags(), unquote(), strip)
    summary = adaptor(extract, remove_tags(), unquote(), strip)
    content = adaptor(extract, remove_tags(), unquote(), strip)

ItemForm

class NewsForm(ItemForm):
    url = adaptor(extract, remove_tags(), unquote(), strip)
    headline = adaptor(extract, remove_tags(), unquote(), strip)
    summary = adaptor(extract, remove_tags(), unquote(), strip)
    content = adaptor(extract, remove_tags(), unquote(), strip)
  • Item form doesn't know about items, maybe just Form :)

Creating an Item

RobustItem

ItemAdaptor

ia = NewsAdaptor(response)
ia.url = response.url
ia.headline = xhs.x('//h1[@class="headline"]')
ia.summary = xhs.x('//div[@class="summary"]')
ia.content = xhs.x('//div[@id="body"]')

return ia.item_instance

ItemForm

nf = NewsForm(response)
nf.url = response.url
nf.headline = xhs.x('//h1[@class="headline"]')
nf.summary = xhs.x('//div[@class="summary"]')
nf.content = xhs.x('//div[@id="body"]')

return NewsItem(nf.as_dict())
  • NewsForm can be used to initialize any kind of Item or a portion of an Item

Using different adaptors per Spider/Site

RobustItem

ItemAdaptor

class SiteNewsAdaptor(NewsAdaptor):
    published = adaptor(HtmlNewsAdaptor.published, to_date('%d.%m.%Y'))

ItemForm

class SiteNewsFrom(NewsForm):
    published = adaptor(HtmlNewsAdaptor.published, to_date('%d.%m.%Y'))

Checking the value of a beign-extracted item

RobustItem

ItemAdaptor

ia = NewsAdaptor(response)
ia.headline = xhs.x('//h1[@class="headline"]')
nf.not ia.headline:
    ia.headline = xhs.x('//h1[@class="title"]')

ItemForm

nf = NewsForm(response)
nf.headline = xhs.x('//h1[@class="headline"]')
if not nf.headline:
    nf.headline = xhs.x('//h1[@class="title"]')

Adding a value to a list attribute/field

RobustItem

ItemAdaptor

ItemForm

Passsing run-time arguments to adaptors

RobustItem

ItemAdaptor

ItemForm