Scrapy 0.8 Changes
This section of Scrapy wiki documents all new features and backwards-incompatible changes to Scrapy 0.8 since the 0.7 release.
Contents
- New features
-
Backwards-incompatible changes
- Changed scrapy.utils.response.get_meta_refresh() signature
- Removed deprecated scrapy.item.ScrapedItem class
- Removed deprecated scrapy.xpath module
- Removed deprecated core.signals.domain_open signal
- log.msg() now receives a spider argument
- Changed core signals domain_opened, domain_closed, …
- Changed Item pipeline to use spiders instead of domains
- Changed Stats API to use spiders instead of domains
- CloseDomain extension moved to …
- Removed deprecated SCRAPYSETTINGS_MODULE environment variable
- Renamed setting: REQUESTS_PER_DOMAIN to …
- Renamed setting: CONCURRENT_DOMAINS to CONCURRENT_SPIDERS
- Refactored HTTP Cache middleware
- Renamed exception: DontCloseDomain to DontCloseSpider
- Renamed extension: DelayedCloseDomain to SpiderCloseDelay
- Removed obsolete scrapy.utils.markup.remove_escape_chars function
New features
Added DEFAULT_RESPONSE_ENCODING setting
Added dont_click argument to FormRequest.from_response() method
Added clickdata argument to FormRequest.from_response() method
Added support for HTTP proxies (HttpProxyMiddleware)
Offiste spider middleware now logs messages when filtering out requests
Backwards-incompatible changes
Changed scrapy.utils.response.get_meta_refresh() signature
scrapy.utils.response.get_meta_refresh() now returns a (interval, absolute_url) tuple, where interval is an int.
Removed deprecated scrapy.item.ScrapedItem class
Use scrapy.item.Item instead.
Removed deprecated scrapy.xpath module
Use scrapy.selector instead.
Removed deprecated core.signals.domain_open signal
Use core.signals.domain_opened instead.
log.msg() now receives a spider argument
Old domain argument has been deprecated and will be removed in 0.9. For spiders, you should always use the spider argument and pass spider references. If you really want to pass a string, use the component argument instead.
Changed core signals domain_opened, domain_closed, domain_idle
These core signals have been renamed and only pass spider references now. Here's a summary of the changes:
| scrapy.core signals (Before) | scrapy.core signals (Now) |
| domain_opened(domain, spider) | spider_opened(spider) |
| domain_closed(domain, spider, reason) | spider_closed(spider, reason) |
| domain_idle(domain, spider) | spider_idle(spider) |
To quickly port your code (to work with Scrapy 0.8) just use spider.domain_name where you previously used domain.
Changed Item pipeline to use spiders instead of domains
The domain argument of process_item() item pipeline method was changed to spider, the new signature is: process_item(spider, item).
To quickly port your code (to work with Scrapy 0.8) just use spider.domain_name where you previously used domain.
Changed Stats API to use spiders instead of domains
- StatsCollector was changed to receive spider references (instead of domains) in its methods (set_value, inc_value, etc).
- added StatsCollector.iter_spider_stats() method
- removed StatsCollector.list_domains() method
Also, Stats signals were renamed and now pass around spider references (instead of domains). Here's a summary of the changes:
| Stats signals (Before) | Stats signals (Now) |
| stats_domain_opened(domain) | stats_spider_opened(spider) |
| stats_domain_closing(domain) | stats_spider_closing(spider) |
| stats_domain_closed(domain, domain_stats) | stats_spider_closed(spider, spider_stats) |
To quickly port your code (to work with Scrapy 0.8) just use spider.domain_name where you previously used domain. spider_stats contains exactly the same data as domain_stats.
CloseDomain extension moved to scrapy.contrib.closespider.CloseSpider
Its settings were also renamed:
- CLOSEDOMAIN_TIMEOUT to CLOSESPIDER_TIMEOUT
- CLOSEDOMAIN_ITEMCOUNT to CLOSESPIDER_ITEMCOUNT
Removed deprecated SCRAPYSETTINGS_MODULE environment variable
Use SCRAPY_SETTINGS_MODULE instead.
Renamed setting: REQUESTS_PER_DOMAIN to CONCURRENT_REQUESTS_PER_SPIDER
Renamed setting: CONCURRENT_DOMAINS to CONCURRENT_SPIDERS
Refactored HTTP Cache middleware
HTTP Cache middleware has been heavilty refactored, retaining the same functionality except for the domain sectorization which was removed.
Renamed exception: DontCloseDomain to DontCloseSpider
Renamed extension: DelayedCloseDomain to SpiderCloseDelay
Removed obsolete scrapy.utils.markup.remove_escape_chars function
Use scrapy.utils.markup.replace_escape_chars instead
