Changeset 1841:59d784dfbf9a

Show
Ignore:
Timestamp:
11/12/09 10:17:21 (9 months ago)
Author:
Pablo Hoffman <pablo@…>
Branch:
default
Message:

made offsite middleware log messages when filtering out requests

Files:
3 modified

Legend:

Unmodified
Added
Removed
  • docs/faq.rst

    r1785 r1841  
    119119    scrapy-ctl.py runspider my_spider.py 
    120120 
     121I get "Filtered offsite request" messages. How can I fix them? 
     122-------------------------------------------------------------- 
     123 
     124Those messages (logged with ``DEBUG`` level) don't necesarilly mean there is a 
     125problem, so mat not need to fix them. 
     126 
     127Those message are thrown by the Offsite Spider Middleware, which is a spider 
     128middleware (enabled by default) whose purpose is to filter out requests to 
     129domains outside the ones covered by the spider. 
     130 
     131For more info see: 
     132:class:`~scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware`. 
  • docs/topics/spider-middleware.rst

    r1690 r1841  
    211211   Filters out Requests for URLs outside the domains covered by the spider. 
    212212 
    213    This middleware filters out every request whose host names doesn't match 
     213   This middleware filters out every request whose host names don't match 
    214214   :attr:`~scrapy.spider.BaseSpider.domain_name`, or the spider 
    215215   :attr:`~scrapy.spider.BaseSpider.domain_name` prefixed by "www.".   
     
    217217   :attr:`~scrapy.spider.BaseSpider.extra_domain_names` attribute. 
    218218 
     219   When your spider returns a request for a domain not belonging to those 
     220   covered by the spider, this middleware will log a debug message similar to 
     221   this one:: 
     222 
     223      DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html> 
     224 
     225   To avoid filling the log with too much noise, it will only print one of 
     226   these messages for each new domain filtered. So, for example, if another 
     227   request for ``www.othersite.com`` is filtered, no log message will be 
     228   printed. But if a request for ``someothersite.com`` is filtered, a message 
     229   will be printed (but only for the first request filtred). 
     230 
    219231RefererMiddleware 
    220232----------------- 
  • scrapy/contrib/spidermiddleware/offsite.py

    r1822 r1841  
    1111from scrapy.http import Request 
    1212from scrapy.utils.httpobj import urlparse_cached 
     13from scrapy import log 
    1314 
    1415class OffsiteMiddleware(object): 
     
    1617    def __init__(self): 
    1718        self.host_regexes = {} 
     19        self.domains_seen = {} 
    1820        dispatcher.connect(self.spider_opened, signal=signals.spider_opened) 
    1921        dispatcher.connect(self.spider_closed, signal=signals.spider_closed) 
    2022 
    2123    def process_spider_output(self, response, result, spider): 
    22         return (x for x in result if not isinstance(x, Request) or \ 
    23             self.should_follow(x, spider)) 
     24        for x in result: 
     25            if isinstance(x, Request): 
     26                if self.should_follow(x, spider): 
     27                    yield x 
     28                else: 
     29                    domain = urlparse_cached(x).hostname 
     30                    if domain and domain not in self.domains_seen[spider]: 
     31                        log.msg("Filtered offsite request to %r: %s" % (domain, x), 
     32                            level=log.DEBUG, spider=spider) 
     33                        self.domains_seen[spider].add(domain) 
     34            else: 
     35                yield x 
    2436 
    2537    def should_follow(self, request, spider): 
     
    3850        domains = [spider.domain_name] + spider.extra_domain_names 
    3951        self.host_regexes[spider] = self.get_host_regex(domains) 
     52        self.domains_seen[spider] = set() 
    4053 
    4154    def spider_closed(self, spider): 
    4255        del self.host_regexes[spider] 
     56        del self.domains_seen[spider]