root/docs/topics/request-response.rst
@
1936:a32f3fe4fe6d
| Revision 1936:a32f3fe4fe6d, 21.3 kB (checked in by Pablo Hoffman <pablo@…>, 7 months ago) |
|---|
Requests and Responses
Scrapy uses :class:`Request` and :class:`Response` objects for crawling web sites.
Typically, :class:`Request` objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a :class:`Response` object which travels back to the spider that issued the request.
Both :class:`Request` and :class:`Response` classes have subclasses which adds additional functionality not required in the base classes. These are described below in :ref:`topics-request-response-ref-request-subclasses` and :ref:`topics-request-response-ref-response-subclasses`.
Request objects
| param url: | the URL of this request |
|---|---|
| type url: | string |
| param callback: | the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see :ref:`topics-request-response-ref-request-callback-arguments` below. |
| type callback: | callable |
| param method: | the HTTP method of this request. Defaults to 'GET'. |
| type method: | string |
| param meta: | the initial values for the :attr:`Request.meta` attribute. If given, the dict passed in this parameter will be shallow copied. |
| type meta: | dict |
| param body: | the request body. If a unicode is passed, then it's encoded to str using the encoding passed (which defaults to utf-8). If body is not given,, an empty string is stored. Regardless of the type of this argument, the final value stored will be a str` (never unicode or None). |
| type body: | str or unicode |
| param headers: | the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers). |
| type headers: | dict |
| param cookies: | the request cookies. Example:
request_with_cookies = Request(url="http://www.example.com",
cookies={currency: 'USD', country: 'UY'})
When some site returns cookies (in a response) those are stored in the cookies for that domain and will be sent again in future requests. That's the typical behaviour of any regular web browser. However, if, for some reason, you want to avoid merging with existing cookies you can instruct Scrapy to do so by setting the dont_merge_cookies item in the :attr:`Request.meta`. Example of request without merging cookies:
request_with_cookies = Request(url="http://www.example.com",
cookies={currency: 'USD', country: 'UY'},
meta={'dont_merge_cookies': True})
|
| type cookies: | dict |
| param encoding: | the encoding of this request (defaults to 'utf-8'). This encoding will be used to percent-encode the URL and to convert the body to str (if given as unicode). |
| type encoding: | string |
| param priority: | the priority of this request (defaults to 0.0). The priority is used by the scheduler to define the order used to return requests. It can also be used to feed priorities externally, for example, using an offline long-term scheduler. |
| type encoding: | int or float |
| param dont_filter: | |
indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False. |
|
| type dont_filter: | |
boolean |
|
| param errback: | a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a Twisted Failure instance as first parameter. |
| type errback: | callable |
Caveats with copying Requests and callbacks
When you copy a request using the :meth:`Request.copy` or :meth:`Request.replace` methods the callback of the request is not copied by default. This is because of legacy reasons along with limitations in the underlying network library, which doesn't allow sharing Twisted deferreds.
For example:
request = Request("http://www.example.com", callback=myfunc)
request2 = request.copy() # doesn't copy the callback
request3 = request.replace(callback=request.callback)
In the above example, request2 is a copy of request but it has no callback, while request3 is a copy of request and also contains the callback.
Passing arguments to callback functions
The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the :class:`Response` object downloaded as its first argument.
Example:
def parse_page1(self, response):
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.log("Visited %s" % response.url)
In some cases you may be interested in passing arguments to those callback functions so you can receive those arguments later, when the response is downloaded. There are two ways for doing this:
using a lambda function (or any other function/callable)
using the :attr:`Request.meta` attribute.
Here's an example of logging the referer URL of each page using each mechanism. Keep in mind, however, that the referer URL could be accessed easier via response.request.url).
Using lambda function:
def parse_page1(self, response):
myarg = response.url
request = Request("http://www.example.com/some_page.html",
callback=lambda r: self.parse_page2(r, myarg))
def parse_page2(self, response, referer_url):
self.log("Visited page %s from %s" % (response.url, referer_url))
Using Request.meta:
def parse_page1(self, response):
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['referer_url'] = response.url
def parse_page2(self, response):
referer_url = response.request.meta['referer_url']
self.log("Visited page %s from %s" % (response.url, referer_url))
Request subclasses
Here is the list of built-in :class:`Request` subclasses. You can also subclass it to implement your own custom functionality.
FormRequest objects
The FormRequest class extends the base :class:`Request` with functionality for dealing with HTML forms. It uses the ClientForm library (bundled with Scrapy) to pre-populate form fields with form data from :class:`Response` objects.
Request usage examples
Using FormRequest to send data via HTTP POST
If you want to simulate a HTML Form POST in your spider, and send a couple of key-value fields you could return a :class:`FormRequest` object (from your spider) like this:
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', age: '27'},
callback=self.after_post)]
Using FormRequest.from_response() to simulate a user login
It is usual for web sites to provide pre-populated form fields through <input type="hidden"> elements, such as session related data or authentication tokens (for login pages). When scraping, you'll want these fields to be automatically pre-populated and only override a couple of them, such as the user name and password. You can use the :meth:`FormRequest.from_response` method for this job. Here's an example spider which uses it:
class LoginSpider(BaseSpider):
domain_name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# continue scraping with authenticated session...
Response objects
| param url: | the URL of this response |
|---|---|
| type url: | string |
| param headers: | the headers of this response. The dict values can be strings (for single valued headers) or lists (for multi-valued headers). |
| type headers: | dict |
| param status: | the HTTP status of the response. Defaults to 200. |
| type status: | integer |
| param body: | the response body. It must be str, not unicode, unless you're using a encoding-aware :ref:`Response subclass <topics-request-response-ref-response-subclasses>`, such as :class:`TextResponse`. |
| type body: | str |
| param meta: | the initial values for the :attr:`Response.meta` attribute. If given, the dict will be shallow copied. |
| type meta: | dict |
| param flags: | is a list containing the initial values for the :attr:`Response.flags` attribute. If given, the list will be shallow copied. |
| type flags: | list |
Response subclasses
Here is the list of available built-in Response subclasses. You can also subclass the Response class to implement your own functionality.
