Changeset 1849:1d0ac164cf62
- Timestamp:
- 11/14/09 20:28:59 (9 months ago)
- Branch:
- default
- Files:
-
- 18 modified
-
docs/topics/stats.rst (modified) (7 diffs)
-
scrapy/contrib/corestats.py (modified) (2 diffs)
-
scrapy/contrib/downloadermiddleware/stats.py (modified) (3 diffs)
-
scrapy/contrib/itemsampler.py (modified) (4 diffs)
-
scrapy/contrib/pipeline/images.py (modified) (3 diffs)
-
scrapy/contrib/spidermiddleware/depth.py (modified) (3 diffs)
-
scrapy/contrib/statsmailer.py (modified) (2 diffs)
-
scrapy/contrib/webconsole/stats.py (modified) (1 diff)
-
scrapy/contrib_exp/spiderprofiler.py (modified) (1 diff)
-
scrapy/core/engine.py (modified) (2 diffs)
-
scrapy/core/scraper.py (modified) (6 diffs)
-
scrapy/stats/collector/__init__.py (modified) (4 diffs)
-
scrapy/stats/collector/mysql.py (modified) (2 diffs)
-
scrapy/stats/collector/simpledb.py (modified) (1 diff)
-
scrapy/stats/signals.py (modified) (1 diff)
-
scrapy/tests/test_downloadermiddleware_stats.py (modified) (3 diffs)
-
scrapy/tests/test_spidermiddleware_depth.py (modified) (4 diffs)
-
scrapy/tests/test_stats.py (modified) (3 diffs)
Legend:
- Unmodified
- Added
- Removed
-
docs/topics/stats.rst
r1822 r1849 9 9 10 10 Scrapy provides a convenient service for collecting stats in the form of 11 key/values, both globally and per spider /domain. It's called the Stats12 Collector, and it's a singleton which can be imported and used quickly, as 13 illustrated by theexamples in the :ref:`topics-stats-usecases` section below.11 key/values, both globally and per spider. It's called the Stats Collector, and 12 it's a singleton which can be imported and used quickly, as illustrated by the 13 examples in the :ref:`topics-stats-usecases` section below. 14 14 15 15 The stats collection is enabled by default but can be disabled through the … … 27 27 enabled) and extremely efficient (almost unnoticeable) when disabled. 28 28 29 The Stats Collector keeps one stats table per open spider /domain and one global30 stats table. You can't set or get stats from a closed domain, but the 31 domain-specific stats table is automatically opened when the spider is opened, 32 and closed whenthe spider is closed.29 The Stats Collector keeps one stats table per open spider and one global stats 30 table. You can't set or get stats from a closed spider, but the spider-specific 31 stats table is automatically opened when the spider is opened, and closed when 32 the spider is closed. 33 33 34 34 .. _topics-stats-usecases: … … 62 62 8 63 63 64 Get all global stats from a given domain::64 Get all global stats (ie. not particular to any spider):: 65 65 66 66 >>> stats.get_stats() 67 67 {'hostname': 'localhost', 'spiders_crawled': 8} 68 68 69 Set domain/spider specific stat value (domains must be opened first, but this69 Set spider specific stat value (spider stats must be opened first, but this 70 70 task is handled automatically by the Scrapy engine):: 71 71 72 stats.set_value('start_time', datetime.now(), domain='example.com') 73 74 Increment domain-specific stat value:: 75 76 stats.inc_value('pages_crawled', domain='example.com') 77 78 Set domain-specific stat value only if greater than previous:: 79 80 stats.max_value('max_items_scraped', value, domain='example.com') 81 82 Set domain-specific stat value only if lower than previous:: 83 84 stats.min_value('min_free_memory_percent', value, domain='example.com') 85 86 Get domain-specific stat value:: 87 88 >>> stats.get_value('pages_crawled', domain='example.com') 72 stats.set_value('start_time', datetime.now(), spider=some_spider) 73 74 Where ``some_spider`` is a :class:`~scrapy.spider.BaseSpider` object. 75 76 Increment spider-specific stat value:: 77 78 stats.inc_value('pages_crawled', spider=some_spider) 79 80 Set spider-specific stat value only if greater than previous:: 81 82 stats.max_value('max_items_scraped', value, spider=some_spider) 83 84 Set spider-specific stat value only if lower than previous:: 85 86 stats.min_value('min_free_memory_percent', value, spider=some_spider) 87 88 Get spider-specific stat value:: 89 90 >>> stats.get_value('pages_crawled', spider=some_spider) 89 91 1238 90 92 91 Get all stats from a given domain::92 93 >>> stats.get_stats('pages_crawled', domain='example.com')93 Get all stats from a given spider:: 94 95 >>> stats.get_stats('pages_crawled', spider=some_spider) 94 96 {'pages_crawled': 1238, 'start_time': datetime.datetime(2009, 7, 14, 21, 47, 28, 977139)} 95 97 … … 109 111 .. class:: StatsCollector 110 112 111 .. method:: get_value(key, default=None, domain=None)113 .. method:: get_value(key, default=None, spider=None) 112 114 113 115 Return the value for the given stats key or default if it doesn't exist. 114 If domain is ``None`` the global stats table is consulted, otherthe115 domain specific one is. If the domainis not yet opened a ``KeyError``116 If spider is ``None`` the global stats table is consulted, otherwise the 117 spider specific one is. If the spider is not yet opened a ``KeyError`` 116 118 exception is raised. 117 119 118 .. method:: get_stats( domain=None)119 120 Get all stats from the given domain/spider (if domain is given) or all121 global stats otherwise, as a dict. If domain is not opened ``KeyError``122 israied.123 124 .. method:: set_value(key, value, domain=None)120 .. method:: get_stats(spider=None) 121 122 Get all stats from the given spider (if spider is given) or all global 123 stats otherwise, as a dict. If spider is not opened ``KeyError`` is 124 raied. 125 126 .. method:: set_value(key, value, spider=None) 125 127 126 128 Set the given value for the given stats key on the global stats (if 127 domain is not given) or the domain-specific stats (if domainis given),129 spider is not given) or the spider-specific stats (if spider is given), 128 130 which must be opened or a ``KeyError`` will be raised. 129 131 130 .. method:: set_stats(stats, domain=None)131 132 Set the given stats (as a dict) for the given domain. If the domainis132 .. method:: set_stats(stats, spider=None) 133 134 Set the given stats (as a dict) for the given spider. If the spider is 133 135 not opened a ``KeyError`` will be raised. 134 136 135 .. method:: inc_value(key, count=1, start=0, domain=None)137 .. method:: inc_value(key, count=1, start=0, spider=None) 136 138 137 139 Increment the value of the given stats key, by the given count, 138 assuming the start value given (when it's not set). If domainis not139 given the global stats table is used, otherwise the domain-specific140 assuming the start value given (when it's not set). If spider is not 141 given the global stats table is used, otherwise the spider-specific 140 142 stats table is used, which must be opened or a ``KeyError`` will be 141 143 raised. 142 144 143 .. method:: max_value(key, value, domain=None)145 .. method:: max_value(key, value, spider=None) 144 146 145 147 Set the given value for the given key only if current value for the 146 148 same key is lower than value. If there is no current value for the 147 given key, the value is always set. If domainis not given the global148 stats table is used, otherwise the domain-specific stats table is used,149 given key, the value is always set. If spider is not given the global 150 stats table is used, otherwise the spider-specific stats table is used, 149 151 which must be opened or a KeyError will be raised. 150 152 151 .. method:: min_value(key, value, domain=None)153 .. method:: min_value(key, value, spider=None) 152 154 153 155 Set the given value for the given key only if current value for the 154 156 same key is greater than value. If there is no current value for the 155 given key, the value is always set. If domainis not given the global156 stats table is used, otherwise the domain-specific stats table is used,157 given key, the value is always set. If spider is not given the global 158 stats table is used, otherwise the spider-specific stats table is used, 157 159 which must be opened or a KeyError will be raised. 158 160 159 .. method:: clear_stats( domain=None)160 161 Clear all global stats (if domain is not given) or all domain-specific162 stats if domainis given, in which case it must be opened or a161 .. method:: clear_stats(spider=None) 162 163 Clear all global stats (if spider is not given) or all spider-specific 164 stats if spider is given, in which case it must be opened or a 163 165 ``KeyError`` will be raised. 164 166 165 .. method:: list_domains() 166 167 Return a list of all opened domains. 168 169 .. method:: open_domain(domain) 170 171 Open the given domain for stats collection. This method must be called 172 prior to working with any stats specific to that domain, but this task 167 .. method:: iter_spider_stats() 168 169 Return a iterator over ``(spider, spider_stats)`` for each open spider 170 currently tracked by the stats collector, where ``spider_stats`` is the 171 dict containing all spider-specific stats. 172 173 Global stats are not included in the iterator. If you want to get 174 those, use :meth:`get_stats` method. 175 176 .. method:: open_spider(spider) 177 178 Open the given spider for stats collection. This method must be called 179 prior to working with any stats specific to that spider, but this task 173 180 is handled automatically by the Scrapy engine. 174 181 175 .. method:: close_ domain(domain)176 177 Close the given domain. After this is called, no more specific stats178 for this domaincan be accessed. This method is called automatically on182 .. method:: close_spider(spider) 183 184 Close the given spider. After this is called, no more specific stats 185 for this spider can be accessed. This method is called automatically on 179 186 the :signal:`spider_closed` signal. 180 187 … … 197 204 198 205 A simple stats collector that keeps the stats of the last scraping run (for 199 each domain) in memory, which can be accessed through the ``domain_stats`` 200 attribute 206 each spider) in memory, after they're closed. The stats can be accessed 207 through the :attr:`domain_stats` attribute, which is a dict keyed by spider 208 domain name. 201 209 202 210 This is the default Stats Collector used in Scrapy. … … 204 212 .. attribute:: domain_stats 205 213 206 A dict of dicts (keyed by domain) containing the stats of the last207 scraping run for each domain.214 A dict of dicts (keyed by spider domain name) containing the stats of 215 the last scraping run for each domain. 208 216 209 217 DummyStatsCollector … … 284 292 :synopsis: Stats Collector signals 285 293 286 .. signal:: stats_ domain_opened287 .. function:: stats_ domain_opened(domain)288 289 Sent right after the stats domainis opened. You can use this signal to add290 startup stats for domain(example: start time).291 292 :param domain: the stats domainjust opened293 :type domain: str294 295 .. signal:: stats_ domain_closing296 .. function:: stats_ domain_closing(domain, reason)297 298 Sent just before the stats domainis closed. You can use this signal to add294 .. signal:: stats_spider_opened 295 .. function:: stats_spider_opened(spider) 296 297 Sent right after the stats spider is opened. You can use this signal to add 298 startup stats for spider (example: start time). 299 300 :param spider: the stats spider just opened 301 :type spider: str 302 303 .. signal:: stats_spider_closing 304 .. function:: stats_spider_closing(spider, reason) 305 306 Sent just before the stats spider is closed. You can use this signal to add 299 307 some closing stats (example: finish time). 300 308 301 :param domain: the stats domainabout to be closed302 :type domain: str303 304 :param reason: the reason why the domainis being closed. See309 :param spider: the stats spider about to be closed 310 :type spider: str 311 312 :param reason: the reason why the spider is being closed. See 305 313 :signal:`spider_closed` signal for more info. 306 314 :type reason: str 307 315 308 .. signal:: stats_ domain_closed309 .. function:: stats_ domain_closed(domain, reason, domain_stats)310 311 Sent right after the stats domainis closed. You can use this signal to312 collect resources, but not to add any more stats as the stats domainhas313 already been close (use :signal:`stats_ domain_closing` for that instead).314 315 :param domain: the stats domainjust closed316 :type domain: str317 318 :param reason: the reason why the domainwas closed. See316 .. signal:: stats_spider_closed 317 .. function:: stats_spider_closed(spider, reason, spider_stats) 318 319 Sent right after the stats spider is closed. You can use this signal to 320 collect resources, but not to add any more stats as the stats spider has 321 already been close (use :signal:`stats_spider_closing` for that instead). 322 323 :param spider: the stats spider just closed 324 :type spider: str 325 326 :param reason: the reason why the spider was closed. See 319 327 :signal:`spider_closed` signal for more info. 320 328 :type reason: str 321 329 322 :param domain_stats: the stats of the domainjust closed.330 :param spider_stats: the stats of the spider just closed. 323 331 :type reason: dict -
scrapy/contrib/corestats.py
r1645 r1849 11 11 from scrapy.core import signals 12 12 from scrapy.stats import stats 13 from scrapy.stats.signals import stats_ domain_opened, stats_domain_closing13 from scrapy.stats.signals import stats_spider_opened, stats_spider_closing 14 14 from scrapy.conf import settings 15 15 … … 23 23 stats.set_value('envinfo/pid', os.getpid()) 24 24 25 dispatcher.connect(self.stats_ domain_opened, signal=stats_domain_opened)26 dispatcher.connect(self.stats_ domain_closing, signal=stats_domain_closing)25 dispatcher.connect(self.stats_spider_opened, signal=stats_spider_opened) 26 dispatcher.connect(self.stats_spider_closing, signal=stats_spider_closing) 27 27 dispatcher.connect(self.item_scraped, signal=signals.item_scraped) 28 28 dispatcher.connect(self.item_passed, signal=signals.item_passed) 29 29 dispatcher.connect(self.item_dropped, signal=signals.item_dropped) 30 30 31 def stats_ domain_opened(self, domain):32 stats.set_value('start_time', datetime.datetime.utcnow(), domain=domain)33 stats.set_value('envinfo/host', stats.get_value('envinfo/host'), domain=domain)34 stats.inc_value(' domain_count/opened')31 def stats_spider_opened(self, spider): 32 stats.set_value('start_time', datetime.datetime.utcnow(), spider=spider) 33 stats.set_value('envinfo/host', stats.get_value('envinfo/host'), spider=spider) 34 stats.inc_value('spider_count/opened') 35 35 36 def stats_ domain_closing(self, domain, reason):37 stats.set_value('finish_time', datetime.datetime.utcnow(), domain=domain)38 stats.set_value('finish_status', 'OK' if reason == 'finished' else reason, domain=domain)39 stats.inc_value(' domain_count/%s' % reason, domain=domain)36 def stats_spider_closing(self, spider, reason): 37 stats.set_value('finish_time', datetime.datetime.utcnow(), spider=spider) 38 stats.set_value('finish_status', 'OK' if reason == 'finished' else reason, spider=spider) 39 stats.inc_value('spider_count/%s' % reason, spider=spider) 40 40 41 41 def item_scraped(self, item, spider): 42 stats.inc_value('item_scraped_count', domain=spider.domain_name)42 stats.inc_value('item_scraped_count', spider=spider) 43 43 stats.inc_value('item_scraped_count') 44 44 45 45 def item_passed(self, item, spider): 46 stats.inc_value('item_passed_count', domain=spider.domain_name)46 stats.inc_value('item_passed_count', spider=spider) 47 47 stats.inc_value('item_passed_count') 48 48 49 49 def item_dropped(self, item, spider, exception): 50 50 reason = exception.__class__.__name__ 51 stats.inc_value('item_dropped_count', domain=spider.domain_name)52 stats.inc_value('item_dropped_reasons_count/%s' % reason, domain=spider.domain_name)51 stats.inc_value('item_dropped_count', spider=spider) 52 stats.inc_value('item_dropped_reasons_count/%s' % reason, spider=spider) 53 53 stats.inc_value('item_dropped_count') -
scrapy/contrib/downloadermiddleware/stats.py
r1345 r1849 6 6 7 7 class DownloaderStats(object): 8 """DownloaderStats store stats of all requests, responses and9 exceptions that pass through it.10 11 To use this middleware you must enable the DOWNLOADER_STATS setting.12 """13 8 14 9 def __init__(self): … … 17 12 18 13 def process_request(self, request, spider): 19 domain = spider.domain_name20 14 stats.inc_value('downloader/request_count') 21 stats.inc_value('downloader/request_count', domain=domain)22 stats.inc_value('downloader/request_method_count/%s' % request.method, domain=domain)15 stats.inc_value('downloader/request_count', spider=spider) 16 stats.inc_value('downloader/request_method_count/%s' % request.method, spider=spider) 23 17 reqlen = len(request_httprepr(request)) 24 stats.inc_value('downloader/request_bytes', reqlen, domain=domain)18 stats.inc_value('downloader/request_bytes', reqlen, spider=spider) 25 19 stats.inc_value('downloader/request_bytes', reqlen) 26 20 27 21 def process_response(self, request, response, spider): 28 domain = spider.domain_name29 22 stats.inc_value('downloader/response_count') 30 stats.inc_value('downloader/response_count', domain=domain)31 stats.inc_value('downloader/response_status_count/%s' % response.status, domain=domain)23 stats.inc_value('downloader/response_count', spider=spider) 24 stats.inc_value('downloader/response_status_count/%s' % response.status, spider=spider) 32 25 reslen = len(response_httprepr(response)) 33 stats.inc_value('downloader/response_bytes', reslen, domain=domain)26 stats.inc_value('downloader/response_bytes', reslen, spider=spider) 34 27 stats.inc_value('downloader/response_bytes', reslen) 35 28 return response … … 38 31 ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__) 39 32 stats.inc_value('downloader/exception_count') 40 stats.inc_value('downloader/exception_count', domain=spider.domain_name)41 stats.inc_value('downloader/exception_type_count/%s' % ex_class, domain=spider.domain_name)33 stats.inc_value('downloader/exception_count', spider=spider) 34 stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider) -
scrapy/contrib/itemsampler.py
r1828 r1849 54 54 55 55 def process_item(self, spider, item): 56 sampled = stats.get_value("items_sampled", 0, domain=spider.domain_name)56 sampled = stats.get_value("items_sampled", 0, spider=spider) 57 57 if sampled < items_per_spider: 58 58 self.items[item.guid] = item 59 59 sampled += 1 60 stats.set_value("items_sampled", sampled, domain=spider.domain_name)60 stats.set_value("items_sampled", sampled, spider=spider) 61 61 log.msg("Sampled %s" % item, spider=spider, level=log.INFO) 62 62 if close_spider and sampled == items_per_spider: … … 72 72 73 73 def spider_closed(self, spider, reason): 74 if reason == 'finished' and not stats.get_value("items_sampled", domain=spider.domain_name):74 if reason == 'finished' and not stats.get_value("items_sampled", spider=spider): 75 75 self.empty_domains.add(spider.domain_name) 76 76 self.spiders_count += 1 … … 88 88 89 89 def process_spider_input(self, response, spider): 90 if stats.get_value("items_sampled", domain=spider.domain_name) >= items_per_spider:90 if stats.get_value("items_sampled", spider=spider) >= items_per_spider: 91 91 return [] 92 92 elif max_response_size and max_response_size > len(response_httprepr(response)): … … 101 101 items.append(r) 102 102 103 if stats.get_value("items_sampled", domain=spider.domain_name) >= items_per_spider:103 if stats.get_value("items_sampled", spider=spider) >= items_per_spider: 104 104 return [] 105 105 else: -
scrapy/contrib/pipeline/images.py
r1829 r1849 220 220 (status, request, referer) 221 221 log.msg(msg, level=log.DEBUG, spider=info.spider) 222 self.inc_stats(info.spider .domain_name, status)222 self.inc_stats(info.spider, status) 223 223 224 224 try: … … 259 259 log.msg('Image (uptodate): Downloaded %s from <%s> referred in <%s>' % \ 260 260 (self.MEDIA_NAME, request.url, referer), level=log.DEBUG, spider=info.spider) 261 self.inc_stats(info.spider .domain_name, 'uptodate')261 self.inc_stats(info.spider, 'uptodate') 262 262 263 263 checksum = result.get('checksum', None) … … 296 296 yield thumb_key, thumb_image, thumb_buf 297 297 298 def inc_stats(self, domain, status):299 stats.inc_value('image_count', domain=domain)300 stats.inc_value('image_status_count/%s' % status, domain=domain)298 def inc_stats(self, spider, status): 299 stats.inc_value('image_count', spider=spider) 300 stats.inc_value('image_status_count/%s' % status, spider=spider) 301 301 302 302 def convert_image(self, image, size=None): -
scrapy/contrib/spidermiddleware/depth.py
r1835 r1849 19 19 20 20 def process_spider_output(self, response, result, spider): 21 domain = spider.domain_name22 21 def _filter(request): 23 22 if isinstance(request, Request): … … 29 28 return False 30 29 elif self.stats: 31 stats.inc_value('request_depth_count/%s' % depth, domain=domain)32 if depth > stats.get_value('request_depth_max', 0, domain=domain):33 stats.set_value('request_depth_max', depth, domain=domain)30 stats.inc_value('request_depth_count/%s' % depth, spider=spider) 31 if depth > stats.get_value('request_depth_max', 0, spider=spider): 32 stats.set_value('request_depth_max', depth, spider=spider) 34 33 return True 35 34 … … 37 36 if self.stats and 'depth' not in response.request.meta: 38 37 response.request.meta['depth'] = 0 39 stats.inc_value('request_depth_count/0', domain=domain)38 stats.inc_value('request_depth_count/0', spider=spider) 40 39 41 40 return (r for r in result or () if _filter(r)) -
scrapy/contrib/statsmailer.py
r1364 r1849 1 1 """ 2 StatsMailer extension sends an email when a domainfinishes scraping.2 StatsMailer extension sends an email when a spider finishes scraping. 3 3 4 4 Use STATSMAILER_RCPTS setting to enable and give the recipient mail address … … 18 18 if not self.recipients: 19 19 raise NotConfigured 20 dispatcher.connect(self.stats_ domain_closed, signal=signals.stats_domain_closed)20 dispatcher.connect(self.stats_spider_closed, signal=signals.stats_spider_closed) 21 21 22 def stats_ domain_closed(self, domain, domain_stats):22 def stats_spider_closed(self, spider, spider_stats): 23 23 mail = MailSender() 24 24 body = "Global stats\n\n" 25 25 body += "\n".join("%-50s : %s" % i for i in stats.get_stats().items()) 26 body += "\n\n%s stats\n\n" % domain27 body += "\n".join("%-50s : %s" % i for i in domain_stats.items())28 mail.send(self.recipients, "Scrapy stats for: %s" % domain, body)26 body += "\n\n%s stats\n\n" % spider.domain_name 27 body += "\n".join("%-50s : %s" % i for i in spider_stats.items()) 28 mail.send(self.recipients, "Scrapy stats for: %s" % spider.domain_name, body) -
scrapy/contrib/webconsole/stats.py
r1358 r1849 23 23 s += "<h3>Global stats</h3>\n" 24 24 s += stats_html_table(stats.get_stats()) 25 for domain in stats.list_domains():26 s += "<h3>%s</h3>\n" % domain27 s += stats_html_table(s tats.get_stats(domain))25 for spider, spider_stats in stats.iter_spider_stats(): 26 s += "<h3>%s</h3>\n" % spider.domain_name 27 s += stats_html_table(spider_stats) 28 28 s += "</body>\n" 29 29 s += "</html>\n" -
scrapy/contrib_exp/spiderprofiler.py
r1641 r1849 46 46 mafter = self._memusage() 47 47 ct = time() - tbefore 48 domain = spider.domain_name 49 tcc = stats.get_value('profiling/total_callback_time', 0, domain=domain) 50 sct = stats.get_value('profiling/slowest_callback_time', 0, domain=domain) 51 stats.set_value('profiling/total_callback_time' % spider.domain_name, \ 52 tcc+ct, domain=domain) 48 tcc = stats.get_value('profiling/total_callback_time', 0, spider=spider) 49 sct = stats.get_value('profiling/slowest_callback_time', 0, spider=spider) 50 stats.set_value('profiling/total_callback_time', tcc+ct, spider=spider) 53 51 if ct > sct: 54 stats.set_value('profiling/slowest_callback_time', ct, domain=domain)52 stats.set_value('profiling/slowest_callback_time', ct, spider=spider) 55 53 stats.set_value('profiling/slowest_callback_name', function.__name__, \ 56 domain=domain)54 spider=spider) 57 55 stats.set_value('profiling/slowest_callback_url', args[0].url, \ 58 domain=domain)56 spider=spider) 59 57 if self._memusage: 60 58 stats.inc_value('profiling/total_mem_allocated_in_callbacks', \ 61 count=mafter-mbefore, domain=domain)59 count=mafter-mbefore, spider=spider) 62 60 return r 63 61 return new_callback -
scrapy/core/engine.py
r1822 r1849 245 245 self.downloader.open_spider(spider) 246 246 self.scraper.open_spider(spider) 247 stats.open_ domain(spider.domain_name)247 stats.open_spider(spider) 248 248 249 249 send_catch_log(signals.spider_opened, sender=self.__class__, spider=spider) … … 306 306 send_catch_log(signal=signals.spider_closed, sender=self.__class__, \ 307 307 spider=spider, reason=reason) 308 stats.close_ domain(spider.domain_name, reason=reason)308 stats.close_spider(spider, reason=reason) 309 309 dfd = defer.maybeDeferred(spiders.close_spider, spider) 310 310 dfd.addBoth(log.msg, "Spider closed (%s)" % reason, spider=spider) -
scrapy/core/scraper.py
r1835 r1849 88 88 site = self.sites[spider] 89 89 dfd = site.add_response_request(response, request) 90 # FIXME: this can't be called here because the stats domainmay be90 # FIXME: this can't be called here because the stats spider may be 91 91 # already closed 92 92 #stats.max_value('scraper/max_active_size', site.active_size, \ 93 # domain=spider.domain_name)93 # spider=spider) 94 94 def finish_scraping(_): 95 95 site.finish_response(response) … … 98 98 dfd.addBoth(finish_scraping) 99 99 dfd.addErrback(log.err, 'Scraper bug processing %s' % request, \ 100 domain=spider.domain_name)100 spider=spider) 101 101 self._scrape_next(spider, site) 102 102 return dfd … … 139 139 log.msg(msg, log.ERROR, spider=spider) 140 140 stats.inc_value("spider_exceptions/%s" % _failure.value.__class__.__name__, \ 141 domain=spider.domain_name)141 spider=spider) 142 142 143 143 def handle_spider_output(self, result, request, response, spider): … … 153 153 """ 154 154 # TODO: keep closing state internally instead of checking engine 155 domain = spider.domain_name156 155 if spider in self.engine.closing: 157 156 return … … 166 165 item=output, spider=spider, response=response) 167 166 self.sites[spider].itemproc_size += 1 168 # FIXME: this can't be called here because the stats domainmay be167 # FIXME: this can't be called here because the stats spider may be 169 168 # already closed 170 169 #stats.max_value('scraper/max_itemproc_size', \ 171 # self.sites[ domain].itemproc_size, domain=domain)170 # self.sites[spider].itemproc_size, spider=spider) 172 171 dfd = self.itemproc.process_item(output, spider) 173 172 dfd.addBoth(self._itemproc_finished, output, spider) … … 196 195 """ItemProcessor finished for the given ``item`` and returned ``output`` 197 196 """ 198 domain = spider.domain_name199 197 self.sites[spider].itemproc_size -= 1 200 198 if isinstance(output, Failure): -
scrapy/stats/collector/__init__.py
r1613 r1849 6 6 from scrapy.xlib.pydispatch import dispatcher 7 7 8 from scrapy.stats.signals import stats_ domain_opened, stats_domain_closing, \9 stats_ domain_closed8 from scrapy.stats.signals import stats_spider_opened, stats_spider_closing, \ 9 stats_spider_closed 10 10 from scrapy.utils.signal import send_catch_log 11 11 from scrapy.core import signals … … 20 20 dispatcher.connect(self._engine_stopped, signal=signals.engine_stopped) 21 21 22 def get_value(self, key, default=None, domain=None):23 return self._stats[ domain].get(key, default)22 def get_value(self, key, default=None, spider=None): 23 return self._stats[spider].get(key, default) 24 24 25 def get_stats(self, domain=None):26 return self._stats[ domain]25 def get_stats(self, spider=None): 26 return self._stats[spider] 27 27 28 def set_value(self, key, value, domain=None):29 self._stats[ domain][key] = value28 def set_value(self, key, value, spider=None): 29 self._stats[spider][key] = value 30 30 31 def set_stats(self, stats, domain=None):32 self._stats[ domain] = stats31 def set_stats(self, stats, spider=None): 32 self._stats[spider] = stats 33 33 34 def inc_value(self, key, count=1, start=0, domain=None):35 d = self._stats[ domain]34 def inc_value(self, key, count=1, start=0, spider=None): 35 d = self._stats[spider] 36 36 d[key] = d.setdefault(key, start) + count 37 37 38 def max_value(self, key, value, domain=None):39 d = self._stats[ domain]38 def max_value(self, key, value, spider=None): 39 d = self._stats[spider] 40 40 d[key] = max(d.setdefault(key, value), value) 41 41 42 def min_value(self, key, value, domain=None):43 d = self._stats[ domain]42 def min_value(self, key, value, spider=None): 43 d = self._stats[spider] 44 44 d[key] = min(d.setdefault(key, value), value) 45 45 46 def clear_stats(self, domain=None):47 self._stats[ domain].clear()46 def clear_stats(self, spider=None): 47 self._stats[spider].clear() 48 48 49 def list_domains(self):50 return [ d for d in self._stats.keys() if d is not None]49 def iter_spider_stats(self): 50 return [x for x in self._stats.iteritems() if x[0]] 51 51 52 def open_ domain(self, domain):53 self._stats[ domain] = {}54 send_catch_log(stats_ domain_opened, domain=domain)52 def open_spider(self, spider): 53 self._stats[spider] = {} 54 send_catch_log(stats_spider_opened, spider=spider) 55 55 56 def close_ domain(self, domain, reason):57 send_catch_log(stats_ domain_closing, domain=domain, reason=reason)58 stats = self._stats.pop( domain)59 send_catch_log(stats_ domain_closed, domain=domain, reason=reason, \60 domain_stats=stats)56 def close_spider(self, spider, reason): 57 send_catch_log(stats_spider_closing, spider=spider, reason=reason) 58 stats = self._stats.pop(spider) 59 send_catch_log(stats_spider_closed, spider=spider, reason=reason, \ 60 spider_stats=stats) 61 61 if self._dump: 62 log.msg("Dumping domainstats:\n" + pprint.pformat(stats), \63 domain=domain)64 self._persist_stats(stats, domain)62 log.msg("Dumping spider stats:\n" + pprint.pformat(stats), \ 63 spider=spider) 64 self._persist_stats(stats, spider) 65 65 66 66 def _engine_stopped(self): … … 68 68 if self._dump: 69 69 log.msg("Dumping global stats:\n" + pprint.pformat(stats)) 70 self._persist_stats(stats, domain=None)70 self._persist_stats(stats, spider=None) 71 71 72 def _persist_stats(self, stats, domain=None):72 def _persist_stats(self, stats, spider=None): 73 73 pass 74 74 … … 78 78 super(MemoryStatsCollector, self).__init__() 79 79 self.domain_stats = {} 80 81 def _persist_stats(self, stats, domain=None): 82 self.domain_stats[domain] = stats 80 81 def _persist_stats(self, stats, spider=None): 82 if spider is not None: 83 self.domain_stats[spider.domain_name] = stats 83 84 84 85 85 86 class DummyStatsCollector(StatsCollector): 86 87 87 def get_value(self, key, default=None, domain=None):88 def get_value(self, key, default=None, spider=None): 88 89 return default 89 90 90 def set_value(self, key, value, domain=None):91 def set_value(self, key, value, spider=None): 91 92 pass 92 93 93 def set_stats(self, stats, domain=None):94 def set_stats(self, stats, spider=None): 94 95 pass 95 96 96 def inc_value(self, key, count=1, start=0, domain=None):97 def inc_value(self, key, count=1, start=0, spider=None): 97 98 pass 98 99 99 def max_value(self, key, value, domain=None):100 def max_value(self, key, value, spider=None): 100 101 pass 101 102 102 def min_value(self, key, value, domain=None):103 def min_value(self, key, value, spider=None): 103 104 pass 104 105 -
scrapy/stats/collector/mysql.py
r1680 r1849 17 17 self._mysql_conn = mysql_connect(mysqluri, use_unicode=False) if mysqluri else None 18 18 19 def _persist_stats(self, stats, domain=None):20 if domain is None: # only store domain-specific stats19 def _persist_stats(self, stats, spider=None): 20 if spider is None: # only store spider-specific stats 21 21 return 22 22 if self._mysql_conn is None: … … 28 28 c = self._mysql_conn.cursor() 29 29 c.execute("INSERT INTO %s (domain,stored,data) VALUES (%%s,%%s,%%s)" % table, \ 30 ( domain, stored, datas))30 (spider.domain_name, stored, datas)) 31 31 self._mysql_conn.commit() -
scrapy/stats/collector/simpledb.py
r1624 r1849 23 23 connect_sdb().create_domain(self._sdbdomain) 24 24 25 def _persist_stats(self, stats, domain=None):26 if domain is None: # only store domain-specific stats25 def _persist_stats(self, stats, spider=None): 26 if spider is None: # only store spider-specific stats 27 27 return 28 28 if not self._sdbdomain: 29 29 return 30 30 if self._async: 31 dfd = threads.deferToThread(self._persist_to_sdb, domain, stats.copy())31 dfd = threads.deferToThread(self._persist_to_sdb, spider, stats.copy()) 32 32 dfd.addErrback(log.err, 'Error uploading stats to SimpleDB', \ 33 domain=domain)33 spider=spider) 34 34 else: 35 self._persist_to_sdb( domain, stats)35 self._persist_to_sdb(spider, stats) 36 36 37 def _persist_to_sdb(self, domain, stats):38 ts = self._get_timestamp( domain).isoformat()39 sdb_item_id = "%s_%s" % ( domain, ts)37 def _persist_to_sdb(self, spider, stats): 38 ts = self._get_timestamp(spider).isoformat() 39 sdb_item_id = "%s_%s" % (spider.domain_name, ts) 40 40 sdb_item = dict((k, self._to_sdb_value(v, k)) for k, v in stats.iteritems()) 41 sdb_item['domain'] = domain41 sdb_item['domain'] = spider.domain_name 42 42 sdb_item['timestamp'] = self._to_sdb_value(ts) 43 43 connect_sdb().put_attributes(self._sdbdomain, sdb_item_id, sdb_item) 44 44 45 def _get_timestamp(self, domain):45 def _get_timestamp(self, spider): 46 46 return datetime.utcnow() 47 47 -
scrapy/stats/signals.py
r1297 r1849 1 stats_ domain_opened = object()2 stats_ domain_closing = object()3 stats_ domain_closed = object()1 stats_spider_opened = object() 2 stats_spider_closing = object() 3 stats_spider_closed = object() -
scrapy/tests/test_downloadermiddleware_stats.py
r1684 r1849 1 1 from unittest import TestCase 2 2 3 from scrapy.conf import settings4 3 from scrapy.contrib.downloadermiddleware.stats import DownloaderStats 5 4 from scrapy.http import Request, Response … … 11 10 12 11 def setUp(self): 13 self.spider = BaseSpider() 14 self.spider.domain_name = 'scrapytest.org' 12 self.spider = BaseSpider('scrapytest.org') 15 13 self.mw = DownloaderStats() 16 14 17 stats.open_ domain(self.spider.domain_name)15 stats.open_spider(self.spider) 18 16 19 17 self.req = Request('scrapytest.org') … … 23 21 self.mw.process_request(self.req, self.spider) 24 22 self.assertEqual(stats.get_value('downloader/request_count', \ 25 domain=self.spider.domain_name), 1)23 spider=self.spider), 1) 26 24 27 25 def test_process_response(self): 28 26 self.mw.process_response(self.req, self.res, self.spider) 29 27 self.assertEqual(stats.get_value('downloader/response_count', \ 30 domain=self.spider.domain_name), 1)28 spider=self.spider), 1) 31 29 32 30 def test_process_exception(self): 33 31 self.mw.process_exception(self.req, Exception(), self.spider) 34 32 self.assertEqual(stats.get_value('downloader/exception_count', \ 35 domain=self.spider.domain_name), 1)33 spider=self.spider), 1) 36 34 37 35 def tearDown(self): 38 stats.close_ domain(self.spider.domain_name, '')36 stats.close_spider(self.spider, '') 39 37 -
scrapy/tests/test_spidermiddleware_depth.py
r1685 r1849 15 15 settings.overrides['DEPTH_STATS'] = True 16 16 17 self.spider = BaseSpider() 18 self.spider.domain_name = 'scrapytest.org' 17 self.spider = BaseSpider('scrapytest.org') 19 18 20 stats.open_ domain(self.spider.domain_name)19 stats.open_spider(self.spider) 21 20 22 21 self.mw = DepthMiddleware() … … 32 31 self.assertEquals(out, result) 33 32 34 rdc = stats.get_value('request_depth_count/1', 35 domain=self.spider.domain_name) 33 rdc = stats.get_value('request_depth_count/1', spider=self.spider) 36 34 self.assertEquals(rdc, 1) 37 35 … … 41 39 self.assertEquals(out2, []) 42 40 43 rdm = stats.get_value('request_depth_max', 44 domain=self.spider.domain_name) 41 rdm = stats.get_value('request_depth_max', spider=self.spider) 45 42 self.assertEquals(rdm, 1) 46 43 … … 50 47 settings.disabled = True 51 48 52 stats.close_ domain(self.spider.domain_name, '')49 stats.close_spider(self.spider, '') 53 50 -
scrapy/tests/test_stats.py
r1821 r1849 1 1 import unittest 2 2 3 from scrapy.spider import BaseSpider 3 4 from scrapy.xlib.pydispatch import dispatcher 4 5 from scrapy.stats.collector import StatsCollector, DummyStatsCollector 5 from scrapy.stats.signals import stats_ domain_opened, stats_domain_closing, \6 stats_ domain_closed6 from scrapy.stats.signals import stats_spider_opened, stats_spider_closing, \ 7 stats_spider_closed 7 8 8 9 class StatsCollectorTest(unittest.TestCase): 10 11 def setUp(self): 12 self.spider = BaseSpider() 9 13 10 14 def test_collector(self): … … 44 48 stats.max_value('v2', 100) 45 49 stats.min_value('v3', 100) 46 stats.open_ domain('a')47 stats.set_value('test', 'value', domain='a')50 stats.open_spider('a') 51 stats.set_value('test', 'value', spider=self.spider) 48 52 self.assertEqual(stats.get_stats(), {}) 49 53 self.assertEqual(stats.get_stats('a'), {}) … … 52 56 signals_catched = set() 53 57 54 def domain_opened(domain):55 assert domain == 'example.com'56 signals_catched.add(stats_ domain_opened)58 def spider_opened(spider): 59 assert spider is self.spider 60 signals_catched.add(stats_spider_opened) 57 61 58 def domain_closing(domain, reason):59 assert domain == 'example.com'62 def spider_closing(spider, reason): 63 assert spider is self.spider 60 64 assert reason == 'testing' 61 signals_catched.add(stats_ domain_closing)65 signals_catched.add(stats_spider_closing) 62 66 63 def domain_closed(domain, reason, domain_stats):64 assert domain == 'example.com'67 def spider_closed(spider, reason, spider_stats): 68 assert spider is self.spider 65 69 assert reason == 'testing' 66 assert domain_stats == {'test': 1}67 signals_catched.add(stats_ domain_closed)70 assert spider_stats == {'test': 1} 71 signals_catched.add(stats_spider_closed) 68 72 69 dispatcher.connect( domain_opened, signal=stats_domain_opened)70 dispatcher.connect( domain_closing, signal=stats_domain_closing)71 dispatcher.connect( domain_closed, signal=stats_domain_closed)73 dispatcher.connect(spider_opened, signal=stats_spider_opened) 74 dispatcher.connect(spider_closing, signal=stats_spider_closing) 75 dispatcher.connect(spider_closed, signal=stats_spider_closed) 72 76 73 77 stats = StatsCollector() 74 stats.open_domain('example.com') 75 stats.set_value('test', 1, domain='example.com') 76 stats.close_domain('example.com', 'testing') 77 assert stats_domain_opened in signals_catched 78 assert stats_domain_closing in signals_catched 79 assert stats_domain_closed in signals_catched 78 stats.open_spider(self.spider) 79 stats.set_value('test', 1, spider=self.spider) 80 self.assertEqual([(self.spider, {'test': 1})], list(stats.iter_spider_stats())) 81 stats.close_spider(self.spider, 'testing') 82 assert stats_spider_opened in signals_catched 83 assert stats_spider_closing in signals_catched 84 assert stats_spider_closed in signals_catched 80 85 81 dispatcher.disconnect( domain_opened, signal=stats_domain_opened)82 dispatcher.disconnect( domain_closing, signal=stats_domain_closing)83 dispatcher.disconnect( domain_closed, signal=stats_domain_closed)86 dispatcher.disconnect(spider_opened, signal=stats_spider_opened) 87 dispatcher.disconnect(spider_closing, signal=stats_spider_closing) 88 dispatcher.disconnect(spider_closed, signal=stats_spider_closed) 84 89 85 90 if __name__ == "__main__":
