Changeset 1843:133d5e60dded
- Timestamp:
- 11/13/09 14:25:47 (9 months ago)
- Branch:
- default
- Files:
-
- 1 added
- 5 modified
-
docs/topics/downloader-middleware.rst (modified) (2 diffs)
-
docs/topics/settings.rst (modified) (1 diff)
-
scrapy/conf/__init__.py (modified) (2 diffs)
-
scrapy/conf/default_settings.py (modified) (1 diff)
-
scrapy/contrib/downloadermiddleware/httpcache.py (modified) (3 diffs)
-
scrapy/tests/test_downloadermiddleware_httpcache.py (added)
Legend:
- Unmodified
- Added
- Removed
-
docs/topics/downloader-middleware.rst
r1785 r1843 215 215 216 216 This middleware provides low-level cache to all HTTP requests and responses. 217 Every request and its corresponding response are cached and then, when that218 samerequest is seen again, the response is returned without transferring217 Every request and its corresponding response are cached. When the same 218 request is seen again, the response is returned without transferring 219 219 anything from the Internet. 220 220 … … 223 223 an Internet connection. 224 224 225 The :class:`HttpCacheMiddleware` can be configured through the following 226 settings (see the settings documentation for more info): 227 228 * :setting:`HTTPCACHE_DIR` - this one actually enables the cache besides 229 settings the cache dir 230 * :setting:`HTTPCACHE_IGNORE_MISSING` - ignoring missing requests instead 231 of downloading them 232 * :setting:`HTTPCACHE_SECTORIZE` - split HTTP cache in several directories 233 (for performance reasons) 234 * :setting:`HTTPCACHE_EXPIRATION_SECS` - how many secs until the cache is 235 considered out of date 225 File system storage 226 ~~~~~~~~~~~~~~~~~~~ 227 228 By default, the :class:`HttpCacheMiddleware` uses a file system storage with the following structure: 229 230 Each request/response pair is stored in a different directory containing with 231 the following files: 232 233 * ``request_body`` - the plain request body 234 * ``request_headers`` - the request headers (in raw HTTP format) 235 * ``response_body`` - the plain response body 236 * ``response_headers`` - the request headers (in raw HTTP format) 237 * ``meta`` - some metadata of this cache resource in Python ``repr()`` format 238 (for easy grepeability) 239 * ``pickled_meta`` - the same metadata in ``meta`` but pickled for more 240 efficient deserialization 241 242 The directory name is made from the request fingerprint (see 243 ``scrapy.utils.request.fingerprint``), and one level of subdirectories is 244 used to avoid creating too many files into the same directory (which is 245 inefficient in many file systems). An example directory could be:: 246 247 /path/to/cache/dir/example.com/72/72811f648e718090f041317756c03adb0ada46c7 248 249 The cache storage backend can be changed with the :setting:`HTTPCACHE_STORAGE` 250 setting, but no other backend is provided with Scrapy yet. 251 252 Settings 253 ~~~~~~~~ 254 255 The :class:`HttpCacheMiddleware` can be configured through the following 256 settings: 257 258 .. setting:: HTTPCACHE_DIR 259 260 HTTPCACHE_DIR 261 ^^^^^^^^^^^^^ 262 263 Default: ``''`` (empty string) 264 265 The directory to use for storing the (low-level) HTTP cache. If empty the HTTP 266 cache will be disabled. 267 268 .. setting:: HTTPCACHE_EXPIRATION_SECS 269 270 HTTPCACHE_EXPIRATION_SECS 271 ^^^^^^^^^^^^^^^^^^^^^^^^^ 272 273 Default: ``0`` 274 275 Number of seconds to use for HTTP cache expiration. Requests that were cached 276 before this time will be re-downloaded. If zero, cached requests will always 277 expire. Negative numbers means requests will never expire. 278 279 .. setting:: HTTPCACHE_IGNORE_MISSING 280 281 HTTPCACHE_IGNORE_MISSING 282 ^^^^^^^^^^^^^^^^^^^^^^^^ 283 284 Default: ``False`` 285 286 If enabled, requests not found in the cache will be ignored instead of downloaded. 287 288 .. setting:: HTTPCACHE_STORAGE 289 290 HTTPCACHE_STORAGE 291 ^^^^^^^^^^^^^^^^^ 292 293 Default: ``'scrapy.contrib.downloadermiddleware.httpcache.FilesystemCacheStorage'`` 294 295 The class which implements the cache storage backend. 296 236 297 237 298 .. _topics-dlmw-robots: -
docs/topics/settings.rst
r1831 r1843 220 220 The version of the bot implemented by this Scrapy project. This will be used to 221 221 construct the User-Agent by default. 222 223 .. setting:: HTTPCACHE_DIR224 225 HTTPCACHE_DIR226 -------------227 228 Default: ``''`` (empty string)229 230 The directory to use for storing the (low-level) HTTP cache. If empty the HTTP231 cache will be disabled.232 233 .. setting:: HTTPCACHE_EXPIRATION_SECS234 235 HTTPCACHE_EXPIRATION_SECS236 -------------------------237 238 Default: ``0``239 240 Number of seconds to use for HTTP cache expiration. Requests that were cached241 before this time will be re-downloaded. If zero, cached requests will always242 expire. Negative numbers means requests will never expire.243 244 .. setting:: HTTPCACHE_IGNORE_MISSING245 246 HTTPCACHE_IGNORE_MISSING247 ------------------------248 249 Default: ``False``250 251 If enabled, requests not found in the cache will be ignored instead of downloaded.252 253 .. setting:: HTTPCACHE_SECTORIZE254 255 HTTPCACHE_SECTORIZE256 -------------------257 258 Default: ``True``259 260 Whether to split HTTP cache storage in several dirs for performance.261 222 262 223 .. setting:: COMMANDS_MODULE -
scrapy/conf/__init__.py
r1840 r1843 14 14 class Settings(object): 15 15 16 def __init__(self ):16 def __init__(self, overrides=None): 17 17 self.defaults = {} 18 18 self.global_defaults = default_settings … … 25 25 pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE") 26 26 self.overrides = pickle.loads(pickled_settings) if pickled_settings else {} 27 if overrides: 28 self.overrides.update(overrides) 27 29 28 30 def __getitem__(self, opt_name): -
scrapy/conf/default_settings.py
r1833 r1843 95 95 HTTPCACHE_DIR = '' 96 96 HTTPCACHE_IGNORE_MISSING = False 97 HTTPCACHE_S ECTORIZE = True97 HTTPCACHE_STORAGE = 'scrapy.contrib.downloadermiddleware.httpcache.FilesystemCacheStorage' 98 98 HTTPCACHE_EXPIRATION_SECS = 0 99 99 -
scrapy/contrib/downloadermiddleware/httpcache.py
r1835 r1843 1 1 from __future__ import with_statement 2 2 3 import errno4 3 import os 5 import hashlib 6 import datetime4 from os.path import join, exists 5 from time import time 7 6 import cPickle as pickle 7 8 8 from scrapy.xlib.pydispatch import dispatcher 9 10 9 from scrapy.core import signals 11 from scrapy import log12 10 from scrapy.http import Headers 13 11 from scrapy.core.exceptions import NotConfigured, IgnoreRequest … … 16 14 from scrapy.utils.http import headers_dict_to_raw, headers_raw_to_dict 17 15 from scrapy.utils.httpobj import urlparse_cached 18 from scrapy.conf import settings 16 from scrapy.utils.misc import load_object 17 from scrapy import conf 19 18 20 19 21 20 class HttpCacheMiddleware(object): 22 def __init__(self): 23 if not settings['HTTPCACHE_DIR']: 24 raise NotConfigured 25 self.cache = Cache(settings['HTTPCACHE_DIR'], sectorize=settings.getbool('HTTPCACHE_SECTORIZE')) 21 22 def __init__(self, settings=conf.settings): 23 self.storage = load_object(settings['HTTPCACHE_STORAGE'])(settings) 26 24 self.ignore_missing = settings.getbool('HTTPCACHE_IGNORE_MISSING') 27 dispatcher.connect(self.open_domain, signal=signals.spider_opened) 25 dispatcher.connect(self.spider_opened, signal=signals.spider_opened) 26 dispatcher.connect(self.spider_closed, signal=signals.spider_closed) 28 27 29 def open_domain(self, spider): 30 self.cache.open_domain(spider.domain_name) 28 def spider_opened(self, spider): 29 self.storage.open_spider(spider) 30 31 def spider_closed(self, spider): 32 self.storage.close_spider(spider) 31 33 32 34 def process_request(self, request, spider): 33 if not is_cacheable(request):35 if not self.is_cacheable(request): 34 36 return 35 36 key = request_fingerprint(request) 37 domain = spider.domain_name 38 39 try: 40 response = self.cache.retrieve_response(domain, key) 41 except: 42 log.msg("Corrupt cache for %s" % request.url, log.WARNING) 43 response = False 44 37 response = self.storage.retrieve_response(spider, request) 45 38 if response: 39 response.flags.append('cached') 46 40 return response 47 41 elif self.ignore_missing: … … 49 43 50 44 def process_response(self, request, response, spider): 51 if is_cacheable(request): 52 key = request_fingerprint(request) 53 self.cache.store(spider.domain_name, key, request, response) 54 45 if self.is_cacheable(request): 46 self.storage.store_response(spider, request, response) 55 47 return response 56 48 57 58 def is_cacheable(request): 59 return urlparse_cached(request).scheme in ['http', 'https'] 49 def is_cacheable(self, request): 50 return urlparse_cached(request).scheme in ['http', 'https'] 60 51 61 52 62 class Cache(object): 63 DOMAIN_SECTORDIR = 'data' 64 DOMAIN_LINKDIR = 'domains' 53 class FilesystemCacheStorage(object): 65 54 66 def __init__(self, cachedir, sectorize=False): 55 def __init__(self, settings=conf.settings): 56 cachedir = settings['HTTPCACHE_DIR'] 57 if not cachedir: 58 raise NotConfigured 67 59 self.cachedir = cachedir 68 self. sectorize = sectorize60 self.expiration_secs = settings.getint('HTTPCACHE_EXPIRATION_SECS') 69 61 70 self.baselinkpath = os.path.join(self.cachedir, self.DOMAIN_LINKDIR) 71 if not os.path.exists(self.baselinkpath): 72 os.makedirs(self.baselinkpath) 62 def open_spider(self, spider): 63 pass 73 64 74 self.basesectorpath = os.path.join(self.cachedir, self.DOMAIN_SECTORDIR) 75 if not os.path.exists(self.basesectorpath): 76 os.makedirs(self.basesectorpath) 65 def close_spider(self, spider): 66 pass 77 67 78 def domainsectorpath(self, domain): 79 sector = hashlib.sha1(domain).hexdigest()[0] 80 return os.path.join(self.basesectorpath, sector, domain) 81 82 def domainlinkpath(self, domain): 83 return os.path.join(self.baselinkpath, domain) 84 85 def requestpath(self, domain, key): 86 linkpath = self.domainlinkpath(domain) 87 return os.path.join(linkpath, key[0:2], key) 88 89 def open_domain(self, domain): 90 if domain: 91 linkpath = self.domainlinkpath(domain) 92 if self.sectorize: 93 sectorpath = self.domainsectorpath(domain) 94 if not os.path.exists(sectorpath): 95 os.makedirs(sectorpath) 96 if not os.path.exists(linkpath): 97 try: 98 os.symlink(sectorpath, linkpath) 99 except: 100 os.makedirs(linkpath) # windows filesystem 101 else: 102 if not os.path.exists(linkpath): 103 os.makedirs(linkpath) 104 105 def read_meta(self, domain, key): 106 """Return the metadata dictionary (possibly empty) if the entry is 107 cached, None otherwise. 108 """ 109 requestpath = self.requestpath(domain, key) 110 try: 111 with open(os.path.join(requestpath, 'pickled_meta'), 'r') as f: 112 metadata = pickle.load(f) 113 except IOError, e: 114 if e.errno != errno.ENOENT: 115 raise 116 return None 117 expiration_secs = settings.getint('HTTPCACHE_EXPIRATION_SECS') 118 if expiration_secs >= 0: 119 expiration_date = metadata['timestamp'] + datetime.timedelta(seconds=expiration_secs) 120 if datetime.datetime.utcnow() > expiration_date: 121 log.msg('dropping old cached response from %s' % metadata['timestamp'], \ 122 level=log.DEBUG, domain=domain) 123 return None 124 return metadata 125 126 def retrieve_response(self, domain, key): 127 """ 128 Return response dictionary if request has correspondent cache record; 129 return None if not. 130 """ 131 metadata = self.read_meta(domain, key) 68 def retrieve_response(self, spider, request): 69 """Return response if present in cache, or None otherwise.""" 70 metadata = self._read_meta(spider, request) 132 71 if metadata is None: 133 return None # not cached 134 135 requestpath = self.requestpath(domain, key) 136 responsebody = responseheaders = None 137 with open(os.path.join(requestpath, 'response_body')) as f: 138 responsebody = f.read() 139 with open(os.path.join(requestpath, 'response_headers')) as f: 140 responseheaders = f.read() 141 72 return # not cached 73 rpath = self._get_request_path(spider, request) 74 with open(join(rpath, 'response_body'), 'rb') as f: 75 body = f.read() 76 with open(join(rpath, 'response_headers'), 'rb') as f: 77 rawheaders = f.read() 142 78 url = metadata['url'] 143 headers = Headers(headers_raw_to_dict(responseheaders))144 79 status = metadata['status'] 145 80 headers = Headers(headers_raw_to_dict(rawheaders)) 146 81 respcls = responsetypes.from_args(headers=headers, url=url) 147 response = respcls(url=url, headers=headers, status=status, body=responsebody) 148 response.meta['cached'] = True 149 response.flags.append('cached') 82 response = respcls(url=url, headers=headers, status=status, body=body) 150 83 return response 151 84 152 def store(self, domain, key, request, response): 153 requestpath = self.requestpath(domain, key) 154 if not os.path.exists(requestpath): 155 os.makedirs(requestpath) 85 def store_response(self, spider, request, response): 86 """Store the given response in the cache.""" 87 rpath = self._get_request_path(spider, request) 88 if not exists(rpath): 89 os.makedirs(rpath) 90 metadata = { 91 'url': request.url, 92 'method': request.method, 93 'status': response.status, 94 'timestamp': time(), 95 } 96 with open(join(rpath, 'meta'), 'wb') as f: 97 f.write(repr(metadata)) 98 with open(join(rpath, 'pickled_meta'), 'wb') as f: 99 pickle.dump(metadata, f, protocol=2) 100 with open(join(rpath, 'response_headers'), 'wb') as f: 101 f.write(headers_dict_to_raw(response.headers)) 102 with open(join(rpath, 'response_body'), 'wb') as f: 103 f.write(response.body) 104 with open(join(rpath, 'request_headers'), 'wb') as f: 105 f.write(headers_dict_to_raw(request.headers)) 106 with open(join(rpath, 'request_body'), 'wb') as f: 107 f.write(request.body) 156 108 157 metadata = { 158 'url':request.url, 159 'method': request.method, 160 'status': response.status, 161 'domain': domain, 162 'timestamp': datetime.datetime.utcnow(), 163 } 109 def _get_request_path(self, spider, request): 110 key = request_fingerprint(request) 111 return join(self.cachedir, spider.domain_name, key[0:2], key) 164 112 165 # metadata 166 with open(os.path.join(requestpath, 'meta_data'), 'w') as f: 167 f.write(repr(metadata)) 168 # pickled metadata (to recover without using eval) 169 with open(os.path.join(requestpath, 'pickled_meta'), 'w') as f: 170 pickle.dump(metadata, f) 171 # response 172 with open(os.path.join(requestpath, 'response_headers'), 'w') as f: 173 f.write(headers_dict_to_raw(response.headers)) 174 with open(os.path.join(requestpath, 'response_body'), 'w') as f: 175 f.write(response.body) 176 # request 177 with open(os.path.join(requestpath, 'request_headers'), 'w') as f: 178 f.write(headers_dict_to_raw(request.headers)) 179 if request.body: 180 with open(os.path.join(requestpath, 'request_body'), 'w') as f: 181 f.write(request.body) 113 def _read_meta(self, spider, request): 114 rpath = self._get_request_path(spider, request) 115 metapath = join(rpath, 'pickled_meta') 116 if not exists(metapath): 117 return # not found 118 mtime = os.stat(rpath).st_mtime 119 if 0 <= self.expiration_secs < time() - mtime: 120 return # expired 121 with open(metapath, 'rb') as f: 122 return pickle.load(f)
