Changeset 1843:133d5e60dded

Show
Ignore:
Timestamp:
11/13/09 14:25:47 (9 months ago)
Author:
Pablo Hoffman <pablo@…>
Branch:
default
Message:

Refactored HttpCache? middleware:

* simplified code
* performance improvements
* removed awkward/unused domain sectorization
* it can now receive Settings on constructor
* added unittests
* added documentation about filesystem storage structure

Also made scrapy.conf.Settings objects instantiable with a dict which is used to override default settings.

Files:
1 added
5 modified

Legend:

Unmodified
Added
Removed
  • docs/topics/downloader-middleware.rst

    r1785 r1843  
    215215 
    216216    This middleware provides low-level cache to all HTTP requests and responses. 
    217     Every request and its corresponding response are cached and then, when that 
    218     same request is seen again, the response is returned without transferring 
     217    Every request and its corresponding response are cached. When the same 
     218    request is seen again, the response is returned without transferring 
    219219    anything from the Internet. 
    220220 
     
    223223    an Internet connection. 
    224224 
    225     The :class:`HttpCacheMiddleware` can be configured through the following 
    226     settings (see the settings documentation for more info): 
    227  
    228     * :setting:`HTTPCACHE_DIR` - this one actually enables the cache besides 
    229       settings the cache dir 
    230     * :setting:`HTTPCACHE_IGNORE_MISSING` - ignoring missing requests instead 
    231       of downloading them 
    232     * :setting:`HTTPCACHE_SECTORIZE` - split HTTP cache in several directories 
    233       (for performance reasons) 
    234     * :setting:`HTTPCACHE_EXPIRATION_SECS` - how many secs until the cache is 
    235       considered out of date 
     225File system storage 
     226~~~~~~~~~~~~~~~~~~~ 
     227 
     228By default, the :class:`HttpCacheMiddleware` uses a file system storage  with the following structure: 
     229 
     230Each request/response pair is stored in a different directory containing with 
     231the following files: 
     232 
     233 * ``request_body`` - the plain request body 
     234 * ``request_headers`` - the request headers (in raw HTTP format) 
     235 * ``response_body`` - the plain response body 
     236 * ``response_headers`` - the request headers (in raw HTTP format) 
     237 * ``meta`` - some metadata of this cache resource in Python ``repr()`` format 
     238   (for easy grepeability) 
     239 * ``pickled_meta`` - the same metadata in ``meta`` but pickled for more 
     240   efficient deserialization 
     241 
     242The directory name is made from the request fingerprint (see 
     243``scrapy.utils.request.fingerprint``), and one level of subdirectories is 
     244used to avoid creating too many files into the same directory (which is 
     245inefficient in many file systems). An example directory could be:: 
     246 
     247   /path/to/cache/dir/example.com/72/72811f648e718090f041317756c03adb0ada46c7 
     248 
     249The cache storage backend can be changed with the :setting:`HTTPCACHE_STORAGE` 
     250setting, but no other backend is provided with Scrapy yet. 
     251 
     252Settings 
     253~~~~~~~~ 
     254 
     255The :class:`HttpCacheMiddleware` can be configured through the following 
     256settings: 
     257 
     258.. setting:: HTTPCACHE_DIR 
     259 
     260HTTPCACHE_DIR 
     261^^^^^^^^^^^^^ 
     262 
     263Default: ``''`` (empty string) 
     264 
     265The directory to use for storing the (low-level) HTTP cache. If empty the HTTP 
     266cache will be disabled. 
     267 
     268.. setting:: HTTPCACHE_EXPIRATION_SECS 
     269 
     270HTTPCACHE_EXPIRATION_SECS 
     271^^^^^^^^^^^^^^^^^^^^^^^^^ 
     272 
     273Default: ``0`` 
     274 
     275Number of seconds to use for HTTP cache expiration. Requests that were cached 
     276before this time will be re-downloaded. If zero, cached requests will always 
     277expire. Negative numbers means requests will never expire. 
     278 
     279.. setting:: HTTPCACHE_IGNORE_MISSING 
     280 
     281HTTPCACHE_IGNORE_MISSING 
     282^^^^^^^^^^^^^^^^^^^^^^^^ 
     283 
     284Default: ``False`` 
     285 
     286If enabled, requests not found in the cache will be ignored instead of downloaded.  
     287 
     288.. setting:: HTTPCACHE_STORAGE 
     289 
     290HTTPCACHE_STORAGE 
     291^^^^^^^^^^^^^^^^^ 
     292 
     293Default: ``'scrapy.contrib.downloadermiddleware.httpcache.FilesystemCacheStorage'`` 
     294 
     295The class which implements the cache storage backend. 
     296 
    236297 
    237298.. _topics-dlmw-robots: 
  • docs/topics/settings.rst

    r1831 r1843  
    220220The version of the bot implemented by this Scrapy project. This will be used to 
    221221construct the User-Agent by default. 
    222  
    223 .. setting:: HTTPCACHE_DIR 
    224  
    225 HTTPCACHE_DIR 
    226 ------------- 
    227  
    228 Default: ``''`` (empty string) 
    229  
    230 The directory to use for storing the (low-level) HTTP cache. If empty the HTTP 
    231 cache will be disabled. 
    232  
    233 .. setting:: HTTPCACHE_EXPIRATION_SECS 
    234  
    235 HTTPCACHE_EXPIRATION_SECS 
    236 ------------------------- 
    237  
    238 Default: ``0`` 
    239  
    240 Number of seconds to use for HTTP cache expiration. Requests that were cached 
    241 before this time will be re-downloaded. If zero, cached requests will always 
    242 expire. Negative numbers means requests will never expire. 
    243  
    244 .. setting:: HTTPCACHE_IGNORE_MISSING 
    245  
    246 HTTPCACHE_IGNORE_MISSING 
    247 ------------------------ 
    248  
    249 Default: ``False`` 
    250  
    251 If enabled, requests not found in the cache will be ignored instead of downloaded.  
    252  
    253 .. setting:: HTTPCACHE_SECTORIZE 
    254  
    255 HTTPCACHE_SECTORIZE 
    256 ------------------- 
    257  
    258 Default: ``True`` 
    259  
    260 Whether to split HTTP cache storage in several dirs for performance. 
    261222 
    262223.. setting:: COMMANDS_MODULE 
  • scrapy/conf/__init__.py

    r1840 r1843  
    1414class Settings(object): 
    1515 
    16     def __init__(self): 
     16    def __init__(self, overrides=None): 
    1717        self.defaults = {} 
    1818        self.global_defaults = default_settings 
     
    2525        pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE") 
    2626        self.overrides = pickle.loads(pickled_settings) if pickled_settings else {} 
     27        if overrides: 
     28            self.overrides.update(overrides) 
    2729 
    2830    def __getitem__(self, opt_name): 
  • scrapy/conf/default_settings.py

    r1833 r1843  
    9595HTTPCACHE_DIR = '' 
    9696HTTPCACHE_IGNORE_MISSING = False 
    97 HTTPCACHE_SECTORIZE = True 
     97HTTPCACHE_STORAGE = 'scrapy.contrib.downloadermiddleware.httpcache.FilesystemCacheStorage' 
    9898HTTPCACHE_EXPIRATION_SECS = 0 
    9999 
  • scrapy/contrib/downloadermiddleware/httpcache.py

    r1835 r1843  
    11from __future__ import with_statement 
    22 
    3 import errno 
    43import os 
    5 import hashlib 
    6 import datetime 
     4from os.path import join, exists 
     5from time import time 
    76import cPickle as pickle 
     7 
    88from scrapy.xlib.pydispatch import dispatcher 
    9  
    109from scrapy.core import signals 
    11 from scrapy import log 
    1210from scrapy.http import Headers 
    1311from scrapy.core.exceptions import NotConfigured, IgnoreRequest 
     
    1614from scrapy.utils.http import headers_dict_to_raw, headers_raw_to_dict 
    1715from scrapy.utils.httpobj import urlparse_cached 
    18 from scrapy.conf import settings 
     16from scrapy.utils.misc import load_object 
     17from scrapy import conf 
    1918 
    2019 
    2120class HttpCacheMiddleware(object): 
    22     def __init__(self): 
    23         if not settings['HTTPCACHE_DIR']: 
    24             raise NotConfigured 
    25         self.cache = Cache(settings['HTTPCACHE_DIR'], sectorize=settings.getbool('HTTPCACHE_SECTORIZE')) 
     21 
     22    def __init__(self, settings=conf.settings): 
     23        self.storage = load_object(settings['HTTPCACHE_STORAGE'])(settings) 
    2624        self.ignore_missing = settings.getbool('HTTPCACHE_IGNORE_MISSING') 
    27         dispatcher.connect(self.open_domain, signal=signals.spider_opened) 
     25        dispatcher.connect(self.spider_opened, signal=signals.spider_opened) 
     26        dispatcher.connect(self.spider_closed, signal=signals.spider_closed) 
    2827 
    29     def open_domain(self, spider): 
    30         self.cache.open_domain(spider.domain_name) 
     28    def spider_opened(self, spider): 
     29        self.storage.open_spider(spider) 
     30 
     31    def spider_closed(self, spider): 
     32        self.storage.close_spider(spider) 
    3133 
    3234    def process_request(self, request, spider): 
    33         if not is_cacheable(request): 
     35        if not self.is_cacheable(request): 
    3436            return 
    35  
    36         key = request_fingerprint(request) 
    37         domain = spider.domain_name 
    38  
    39         try: 
    40             response = self.cache.retrieve_response(domain, key) 
    41         except: 
    42             log.msg("Corrupt cache for %s" % request.url, log.WARNING) 
    43             response = False 
    44  
     37        response = self.storage.retrieve_response(spider, request) 
    4538        if response: 
     39            response.flags.append('cached') 
    4640            return response 
    4741        elif self.ignore_missing: 
     
    4943 
    5044    def process_response(self, request, response, spider): 
    51         if is_cacheable(request): 
    52             key = request_fingerprint(request) 
    53             self.cache.store(spider.domain_name, key, request, response) 
    54  
     45        if self.is_cacheable(request): 
     46            self.storage.store_response(spider, request, response) 
    5547        return response 
    5648 
    57  
    58 def is_cacheable(request): 
    59     return urlparse_cached(request).scheme in ['http', 'https'] 
     49    def is_cacheable(self, request): 
     50        return urlparse_cached(request).scheme in ['http', 'https'] 
    6051 
    6152 
    62 class Cache(object): 
    63     DOMAIN_SECTORDIR = 'data' 
    64     DOMAIN_LINKDIR = 'domains' 
     53class FilesystemCacheStorage(object): 
    6554 
    66     def __init__(self, cachedir, sectorize=False): 
     55    def __init__(self, settings=conf.settings): 
     56        cachedir = settings['HTTPCACHE_DIR'] 
     57        if not cachedir: 
     58            raise NotConfigured 
    6759        self.cachedir = cachedir 
    68         self.sectorize = sectorize 
     60        self.expiration_secs = settings.getint('HTTPCACHE_EXPIRATION_SECS') 
    6961 
    70         self.baselinkpath = os.path.join(self.cachedir, self.DOMAIN_LINKDIR) 
    71         if not os.path.exists(self.baselinkpath): 
    72             os.makedirs(self.baselinkpath) 
     62    def open_spider(self, spider): 
     63        pass 
    7364 
    74         self.basesectorpath = os.path.join(self.cachedir, self.DOMAIN_SECTORDIR) 
    75         if not os.path.exists(self.basesectorpath): 
    76             os.makedirs(self.basesectorpath) 
     65    def close_spider(self, spider): 
     66        pass 
    7767 
    78     def domainsectorpath(self, domain): 
    79         sector = hashlib.sha1(domain).hexdigest()[0] 
    80         return os.path.join(self.basesectorpath, sector, domain) 
    81  
    82     def domainlinkpath(self, domain): 
    83         return os.path.join(self.baselinkpath, domain) 
    84  
    85     def requestpath(self, domain, key): 
    86         linkpath = self.domainlinkpath(domain) 
    87         return os.path.join(linkpath, key[0:2], key) 
    88  
    89     def open_domain(self, domain): 
    90         if domain: 
    91             linkpath = self.domainlinkpath(domain) 
    92             if self.sectorize: 
    93                 sectorpath = self.domainsectorpath(domain) 
    94                 if not os.path.exists(sectorpath): 
    95                     os.makedirs(sectorpath) 
    96                 if not os.path.exists(linkpath): 
    97                     try: 
    98                         os.symlink(sectorpath, linkpath) 
    99                     except: 
    100                         os.makedirs(linkpath) # windows filesystem 
    101             else: 
    102                 if not os.path.exists(linkpath): 
    103                     os.makedirs(linkpath) 
    104  
    105     def read_meta(self, domain, key): 
    106         """Return the metadata dictionary (possibly empty) if the entry is 
    107         cached, None otherwise. 
    108         """ 
    109         requestpath = self.requestpath(domain, key) 
    110         try: 
    111             with open(os.path.join(requestpath, 'pickled_meta'), 'r') as f: 
    112                 metadata = pickle.load(f) 
    113         except IOError, e: 
    114             if e.errno != errno.ENOENT: 
    115                 raise 
    116             return None 
    117         expiration_secs = settings.getint('HTTPCACHE_EXPIRATION_SECS') 
    118         if expiration_secs >= 0: 
    119             expiration_date = metadata['timestamp'] + datetime.timedelta(seconds=expiration_secs) 
    120             if datetime.datetime.utcnow() > expiration_date: 
    121                 log.msg('dropping old cached response from %s' % metadata['timestamp'], \ 
    122                     level=log.DEBUG, domain=domain) 
    123                 return None 
    124         return metadata 
    125  
    126     def retrieve_response(self, domain, key): 
    127         """ 
    128         Return response dictionary if request has correspondent cache record; 
    129         return None if not. 
    130         """ 
    131         metadata = self.read_meta(domain, key) 
     68    def retrieve_response(self, spider, request): 
     69        """Return response if present in cache, or None otherwise.""" 
     70        metadata = self._read_meta(spider, request) 
    13271        if metadata is None: 
    133             return None # not cached 
    134  
    135         requestpath = self.requestpath(domain, key) 
    136         responsebody = responseheaders = None 
    137         with open(os.path.join(requestpath, 'response_body')) as f: 
    138             responsebody = f.read() 
    139         with open(os.path.join(requestpath, 'response_headers')) as f: 
    140             responseheaders = f.read() 
    141  
     72            return # not cached 
     73        rpath = self._get_request_path(spider, request) 
     74        with open(join(rpath, 'response_body'), 'rb') as f: 
     75            body = f.read() 
     76        with open(join(rpath, 'response_headers'), 'rb') as f: 
     77            rawheaders = f.read() 
    14278        url = metadata['url'] 
    143         headers = Headers(headers_raw_to_dict(responseheaders)) 
    14479        status = metadata['status'] 
    145  
     80        headers = Headers(headers_raw_to_dict(rawheaders)) 
    14681        respcls = responsetypes.from_args(headers=headers, url=url) 
    147         response = respcls(url=url, headers=headers, status=status, body=responsebody) 
    148         response.meta['cached'] = True 
    149         response.flags.append('cached') 
     82        response = respcls(url=url, headers=headers, status=status, body=body) 
    15083        return response 
    15184 
    152     def store(self, domain, key, request, response): 
    153         requestpath = self.requestpath(domain, key) 
    154         if not os.path.exists(requestpath): 
    155             os.makedirs(requestpath) 
     85    def store_response(self, spider, request, response): 
     86        """Store the given response in the cache.""" 
     87        rpath = self._get_request_path(spider, request) 
     88        if not exists(rpath): 
     89            os.makedirs(rpath) 
     90        metadata = { 
     91            'url': request.url, 
     92            'method': request.method, 
     93            'status': response.status, 
     94            'timestamp': time(), 
     95        } 
     96        with open(join(rpath, 'meta'), 'wb') as f: 
     97            f.write(repr(metadata)) 
     98        with open(join(rpath, 'pickled_meta'), 'wb') as f: 
     99            pickle.dump(metadata, f, protocol=2) 
     100        with open(join(rpath, 'response_headers'), 'wb') as f: 
     101            f.write(headers_dict_to_raw(response.headers)) 
     102        with open(join(rpath, 'response_body'), 'wb') as f: 
     103            f.write(response.body) 
     104        with open(join(rpath, 'request_headers'), 'wb') as f: 
     105            f.write(headers_dict_to_raw(request.headers)) 
     106        with open(join(rpath, 'request_body'), 'wb') as f: 
     107            f.write(request.body) 
    156108 
    157         metadata = { 
    158                 'url':request.url, 
    159                 'method': request.method, 
    160                 'status': response.status, 
    161                 'domain': domain, 
    162                 'timestamp': datetime.datetime.utcnow(), 
    163             } 
     109    def _get_request_path(self, spider, request): 
     110        key = request_fingerprint(request) 
     111        return join(self.cachedir, spider.domain_name, key[0:2], key) 
    164112 
    165         # metadata 
    166         with open(os.path.join(requestpath, 'meta_data'), 'w') as f: 
    167             f.write(repr(metadata)) 
    168         # pickled metadata (to recover without using eval) 
    169         with open(os.path.join(requestpath, 'pickled_meta'), 'w') as f: 
    170             pickle.dump(metadata, f) 
    171         # response 
    172         with open(os.path.join(requestpath, 'response_headers'), 'w') as f: 
    173             f.write(headers_dict_to_raw(response.headers)) 
    174         with open(os.path.join(requestpath, 'response_body'), 'w') as f: 
    175             f.write(response.body) 
    176         # request 
    177         with open(os.path.join(requestpath, 'request_headers'), 'w') as f: 
    178             f.write(headers_dict_to_raw(request.headers)) 
    179         if request.body: 
    180             with open(os.path.join(requestpath, 'request_body'), 'w') as f: 
    181                 f.write(request.body) 
     113    def _read_meta(self, spider, request): 
     114        rpath = self._get_request_path(spider, request) 
     115        metapath = join(rpath, 'pickled_meta') 
     116        if not exists(metapath): 
     117            return # not found 
     118        mtime = os.stat(rpath).st_mtime 
     119        if 0 <= self.expiration_secs < time() - mtime: 
     120            return # expired 
     121        with open(metapath, 'rb') as f: 
     122            return pickle.load(f)