Changeset 1841:59d784dfbf9a
- Timestamp:
- 11/12/09 10:17:21 (9 months ago)
- Author:
- Pablo Hoffman <pablo@…>
- Branch:
- default
- Message:
-
made offsite middleware log messages when filtering out requests
- Files:
-
Legend:
- Unmodified
- Added
- Removed
-
|
r1785
|
r1841
|
|
| 119 | 119 | scrapy-ctl.py runspider my_spider.py |
| 120 | 120 | |
| | 121 | I get "Filtered offsite request" messages. How can I fix them? |
| | 122 | -------------------------------------------------------------- |
| | 123 | |
| | 124 | Those messages (logged with ``DEBUG`` level) don't necesarilly mean there is a |
| | 125 | problem, so mat not need to fix them. |
| | 126 | |
| | 127 | Those message are thrown by the Offsite Spider Middleware, which is a spider |
| | 128 | middleware (enabled by default) whose purpose is to filter out requests to |
| | 129 | domains outside the ones covered by the spider. |
| | 130 | |
| | 131 | For more info see: |
| | 132 | :class:`~scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware`. |
-
|
r1690
|
r1841
|
|
| 211 | 211 | Filters out Requests for URLs outside the domains covered by the spider. |
| 212 | 212 | |
| 213 | | This middleware filters out every request whose host names doesn't match |
| | 213 | This middleware filters out every request whose host names don't match |
| 214 | 214 | :attr:`~scrapy.spider.BaseSpider.domain_name`, or the spider |
| 215 | 215 | :attr:`~scrapy.spider.BaseSpider.domain_name` prefixed by "www.". |
| … |
… |
|
| 217 | 217 | :attr:`~scrapy.spider.BaseSpider.extra_domain_names` attribute. |
| 218 | 218 | |
| | 219 | When your spider returns a request for a domain not belonging to those |
| | 220 | covered by the spider, this middleware will log a debug message similar to |
| | 221 | this one:: |
| | 222 | |
| | 223 | DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html> |
| | 224 | |
| | 225 | To avoid filling the log with too much noise, it will only print one of |
| | 226 | these messages for each new domain filtered. So, for example, if another |
| | 227 | request for ``www.othersite.com`` is filtered, no log message will be |
| | 228 | printed. But if a request for ``someothersite.com`` is filtered, a message |
| | 229 | will be printed (but only for the first request filtred). |
| | 230 | |
| 219 | 231 | RefererMiddleware |
| 220 | 232 | ----------------- |
-
|
r1822
|
r1841
|
|
| 11 | 11 | from scrapy.http import Request |
| 12 | 12 | from scrapy.utils.httpobj import urlparse_cached |
| | 13 | from scrapy import log |
| 13 | 14 | |
| 14 | 15 | class OffsiteMiddleware(object): |
| … |
… |
|
| 16 | 17 | def __init__(self): |
| 17 | 18 | self.host_regexes = {} |
| | 19 | self.domains_seen = {} |
| 18 | 20 | dispatcher.connect(self.spider_opened, signal=signals.spider_opened) |
| 19 | 21 | dispatcher.connect(self.spider_closed, signal=signals.spider_closed) |
| 20 | 22 | |
| 21 | 23 | def process_spider_output(self, response, result, spider): |
| 22 | | return (x for x in result if not isinstance(x, Request) or \ |
| 23 | | self.should_follow(x, spider)) |
| | 24 | for x in result: |
| | 25 | if isinstance(x, Request): |
| | 26 | if self.should_follow(x, spider): |
| | 27 | yield x |
| | 28 | else: |
| | 29 | domain = urlparse_cached(x).hostname |
| | 30 | if domain and domain not in self.domains_seen[spider]: |
| | 31 | log.msg("Filtered offsite request to %r: %s" % (domain, x), |
| | 32 | level=log.DEBUG, spider=spider) |
| | 33 | self.domains_seen[spider].add(domain) |
| | 34 | else: |
| | 35 | yield x |
| 24 | 36 | |
| 25 | 37 | def should_follow(self, request, spider): |
| … |
… |
|
| 38 | 50 | domains = [spider.domain_name] + spider.extra_domain_names |
| 39 | 51 | self.host_regexes[spider] = self.get_host_regex(domains) |
| | 52 | self.domains_seen[spider] = set() |
| 40 | 53 | |
| 41 | 54 | def spider_closed(self, spider): |
| 42 | 55 | del self.host_regexes[spider] |
| | 56 | del self.domains_seen[spider] |