Changeset 1936:a32f3fe4fe6d

Show
Ignore:
Timestamp:
02/24/10 14:01:29 (5 months ago)
Author:
Pablo Hoffman <pablo@…>
Branch:
default
Message:

Fixed encoding issue (reported in #135) when the encoding declared in the HTTP header is unknown. This is the patch proposed by Rolando, with an update to the Request/Response documentation.

Files:
3 modified

Legend:

Unmodified
Added
Removed
  • docs/topics/request-response.rst

    r1816 r1936  
    467467    .. attribute:: TextResponse.encoding 
    468468 
    469        A string with the encoding of this response. The encoding is resolved in the 
    470        following order:  
     469       A string with the encoding of this response. The encoding is resolved by 
     470       trying the following mechanisms, in order: 
    471471 
    472472       1. the encoding passed in the constructor `encoding` argument 
    473473 
    474        2. the encoding declared in the Content-Type HTTP header 
     474       2. the encoding declared in the Content-Type HTTP header. If this 
     475          encoding is not valid (ie. unknown), it is ignored and the next 
     476          resolution mechanism is tried. 
    475477 
    476478       3. the encoding declared in the response body. The TextResponse class 
  • scrapy/http/response/text.py

    r1809 r1936  
    66""" 
    77 
     8import codecs 
    89import re 
    910 
     
    6566            encoding = self._ENCODING_RE.search(content_type) 
    6667            if encoding: 
    67                 return encoding.group(1) 
     68                enc = encoding.group(1) 
     69                try: 
     70                    codecs.lookup(enc) # check if the encoding is valid 
     71                    return enc 
     72                except LookupError: 
     73                    pass 
    6874 
    6975    @memoizemethod_noargs 
  • scrapy/tests/test_http_response.py

    r1809 r1936  
    176176        r3 = self.response_class("http://www.example.com", headers={"Content-type": ["text/html; charset=iso-8859-1"]}, body="\xa3") 
    177177        r4 = self.response_class("http://www.example.com", body="\xa2\xa3") 
     178        r5 = self.response_class("http://www.example.com", 
     179        headers={"Content-type": ["text/html; charset=None"]}, body="\xc2\xa3") 
    178180 
    179181        self.assertEqual(r1.headers_encoding(), "utf-8") 
     
    183185        self.assertEqual(r3.encoding, 'iso-8859-1') 
    184186        self.assertEqual(r4.headers_encoding(), None) 
     187        self.assertEqual(r5.headers_encoding(), None) 
     188        self.assertEqual(r5.encoding, "utf-8") 
    185189        assert r4.body_encoding() is not None and r4.body_encoding() != 'ascii' 
    186190        self._assert_response_values(r1, 'utf-8', u"\xa3")