Scrapy 0.8 Changes

This section of Scrapy wiki documents all new features and backwards-incompatible changes to Scrapy 0.8 since the 0.7 release.

Contents

  1. New features
    1. Added DEFAULT_RESPONSE_ENCODING setting
    2. Added dont_click argument to FormRequest.from_response()
    3. Added clickdata argument to FormRequest.from_response() method
    4. Added support for HTTP proxies (HttpProxyMiddleware)
    5. Offiste spider middleware now logs messages when filtering out requests
  2. Backwards-incompatible changes
    1. Changed scrapy.utils.response.get_meta_refresh() signature
    2. Removed deprecated scrapy.item.ScrapedItem class
    3. Removed deprecated scrapy.xpath module
    4. Removed deprecated core.signals.domain_open signal
    5. log.msg() now receives a spider argument
    6. Changed core signals domain_opened, domain_closed, …
    7. Changed Item pipeline to use spiders instead of domains
    8. Changed Stats API to use spiders instead of domains
    9. CloseDomain extension moved to …
    10. Removed deprecated SCRAPYSETTINGS_MODULE environment variable
    11. Renamed setting: REQUESTS_PER_DOMAIN to …
    12. Renamed setting: CONCURRENT_DOMAINS to CONCURRENT_SPIDERS
    13. Refactored HTTP Cache middleware
    14. Renamed exception: DontCloseDomain to DontCloseSpider
    15. Renamed extension: DelayedCloseDomain to SpiderCloseDelay
    16. Removed obsolete scrapy.utils.markup.remove_escape_chars function

New features

Added DEFAULT_RESPONSE_ENCODING setting

r1809 | doc

Added dont_click argument to FormRequest.from_response() method

r1813, r1816 | doc

Added clickdata argument to FormRequest.from_response() method

r1802, r1803 | doc

Added support for HTTP proxies (HttpProxyMiddleware)

r1781, r1785 | doc

Offiste spider middleware now logs messages when filtering out requests

r1841 | doc

Backwards-incompatible changes

Changed scrapy.utils.response.get_meta_refresh() signature

scrapy.utils.response.get_meta_refresh() now returns a (interval, absolute_url) tuple, where interval is an int.

r1804

Removed deprecated scrapy.item.ScrapedItem class

Use scrapy.item.Item instead.

r1838

Removed deprecated scrapy.xpath module

Use scrapy.selector instead.

r1836

Removed deprecated core.signals.domain_open signal

Use core.signals.domain_opened instead.

r1822

log.msg() now receives a spider argument

Old domain argument has been deprecated and will be removed in 0.9. For spiders, you should always use the spider argument and pass spider references. If you really want to pass a string, use the component argument instead.

r1822 | new doc

Changed core signals domain_opened, domain_closed, domain_idle

These core signals have been renamed and only pass spider references now. Here's a summary of the changes:

scrapy.core signals (Before)scrapy.core signals (Now)
domain_opened(domain, spider)spider_opened(spider)
domain_closed(domain, spider, reason)spider_closed(spider, reason)
domain_idle(domain, spider)spider_idle(spider)

To quickly port your code (to work with Scrapy 0.8) just use spider.domain_name where you previously used domain.

r1822 | #105 | new doc

Changed Item pipeline to use spiders instead of domains

The domain argument of process_item() item pipeline method was changed to spider, the new signature is: process_item(spider, item).

r1827 | #105 | new doc

To quickly port your code (to work with Scrapy 0.8) just use spider.domain_name where you previously used domain.

Changed Stats API to use spiders instead of domains

  • StatsCollector was changed to receive spider references (instead of domains) in its methods (set_value, inc_value, etc).
  • added StatsCollector.iter_spider_stats() method
  • removed StatsCollector.list_domains() method

Also, Stats signals were renamed and now pass around spider references (instead of domains). Here's a summary of the changes:

Stats signals (Before)Stats signals (Now)
stats_domain_opened(domain)stats_spider_opened(spider)
stats_domain_closing(domain)stats_spider_closing(spider)
stats_domain_closed(domain, domain_stats)stats_spider_closed(spider, spider_stats)

To quickly port your code (to work with Scrapy 0.8) just use spider.domain_name where you previously used domain. spider_stats contains exactly the same data as domain_stats.

r1849 | #113 | new doc

CloseDomain extension moved to scrapy.contrib.closespider.CloseSpider

Its settings were also renamed:

  • CLOSEDOMAIN_TIMEOUT to CLOSESPIDER_TIMEOUT
  • CLOSEDOMAIN_ITEMCOUNT to CLOSESPIDER_ITEMCOUNT

r1833 | new doc

Removed deprecated SCRAPYSETTINGS_MODULE environment variable

Use SCRAPY_SETTINGS_MODULE instead.

r1840

Renamed setting: REQUESTS_PER_DOMAIN to CONCURRENT_REQUESTS_PER_SPIDER

r1830, r1844

Renamed setting: CONCURRENT_DOMAINS to CONCURRENT_SPIDERS

r1830

Refactored HTTP Cache middleware

HTTP Cache middleware has been heavilty refactored, retaining the same functionality except for the domain sectorization which was removed.

r1843 | new doc

Renamed exception: DontCloseDomain to DontCloseSpider

r1859 | #120

Renamed extension: DelayedCloseDomain to SpiderCloseDelay

r1861 | #121

Removed obsolete scrapy.utils.markup.remove_escape_chars function

Use scrapy.utils.markup.replace_escape_chars instead

r1865