Scraping AJAX sites
As Mark Ellul pointed out in the scrapy-users mailing list, there are two basic types of AJAX requests that web sites make use of. These are: "static" requests which their parameters (URL, post data) doesn't change, and "dynamic" requests that use some variables based on properties from the current page.
The general approach when dealing with "static" AJAX requests is adding their URLs to start_urls attribute as a "normal" URL. And to deal with "dynamic" ones we will try to generate the same requests from Scrapy.
To help us in this task, we'll use a Firefox add-on called Firebug. This add-on comes with a Net panel that let us monitor the requests being sent to the server and their responses.
We will scrape Nasa Image of the Day Gallery.
When loading the site we can see that the page loads the gallery information from another source, so to find it out we launch Firebug, go to the Net panel andn reload the page.
In the Net panel, we see each request (and its response) made to load the entire page contents, here we can filter the requests and look for XmlHttpRequests in the XHR tab.
In the XHR tab, we see that two requests are made, one to iotdxml.xml and one to image_feature_NUMBER.xml if we look at the response of the first one (clicking on it and then going to response tab) we see that it holds the gallery slider data.
Now, if we navigate to another photography, clicking on its slider link we'll see that a new request has been made. This request points to image_feature_NUMBER.xml, that looks suspiciously alike the second request that we got when loading the page for first time (that request got the first image on the gallery). So if we look at the iotdxml.xml we'll find that the image URL for finding its complete data is stored in a ap attribute.
So to scrape this site, we add the iotdxml.xml URL to the start_urls attribute of an Spider, parse it and make requests for each individual image (mimicking the requests made when clicking on images).
Here's a simple spider that ilustrates the example:
from urlparse import urljoin from scrapy.http import Request from scrapy.selector import XmlXPathSelector from scrapy.spider import BaseSpider class NasaImagesSpider(BaseSpider): domain_name = "nasa.gov" start_urls = ( 'http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml', ) def parse(self, response): xxs = XmlXPathSelector(response) urls = xxs.select('//ig/ap/text()').extract() for url in urls: abs_url = urljoin(self.start_urls[0], url) + '.xml' yield Request(abs_url, callback=self.parse_image) def parse_image(self, response): # parse individual images here pass SPIDER = NasaImagesSpider()
