root/docs/intro/overview.rst @ 1751:be0228b4d334

Revision 1751:be0228b4d334, 6.1 kB (checked in by Ismael Carnales <icarnales@…>, 12 months ago)

modified doc to reflect the new spider callback return policy (lists not needed)

Scrapy at a glance

Scrapy a is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Even though Scrapy was originally designed for screen scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

The purpose of this document is to introduce you to the concepts behind Scrapy so you can get an idea of how it works and decide if Scrapy is what you need.

When you're ready to start a project, you can :ref:`start with the tutorial <intro-tutorial>`.

System Message: ERROR/3 (<string>, line 21); backlink

Unknown interpreted text role "ref".

Pick a website

So you need to extract some information from a website, but the website doesn't provide any API or mechanism to access that info from a computer program. Scrapy can help you extract that information. Let's say we want to extract information about all torrent files added today in the mininova torrent site.

The list of all torrents added today can be found in this page:

http://www.mininova.org/today

Write a Spider to extract the Items

Now we'll write a Spider which defines the start URL (http://www.mininova.org/today), the rules for following links and the rules for extracting the data from pages.

If we take a look at that page content we'll see that all torrent URLs are like http://www.mininova.org/tor/NUMBER where NUMBER is an integer. We'll use that to construct the regular expression for the links to follow: /tor/\d+.

For extracting data we'll use XPath to select the part of the document where the data is to be extracted. Let's take one of those torrent pages:

http://www.mininova.org/tor/2657665

And look at the page HTML source to construct the XPath to select the data we want to extract which is: torrent name, description and size.

System Message: ERROR/3 (<string>, line 60)

Unknown directive type "highlight".

.. highlight:: html

By looking at the page HTML source we can see that the file name is contained inside a <h1> tag:

<h1>Home[2009][Eng]XviD-ovd</h1>

System Message: ERROR/3 (<string>, line 67)

Unknown directive type "highlight".

.. highlight:: none

An XPath expression to extract the name could be:

//h1/text()

System Message: ERROR/3 (<string>, line 73)

Unknown directive type "highlight".

.. highlight:: html

And the description is contained inside a <div> tag with id="description":

<h2>Description:</h2>
<div id="description">
"HOME" - a documentary film by Yann Arthus-Bertrand
<br/>
<br/>
***
<br/>
<br/>
"We are living in exceptional times. Scientists tell us that we have 10 years to change the way we live, avert the depletion of natural resources and the catastrophic evolution of the Earth's climate.
...

System Message: ERROR/3 (<string>, line 90)

Unknown directive type "highlight".

.. highlight:: none

An XPath expression to select the description could be:

//div[@id='description']

System Message: ERROR/3 (<string>, line 96)

Unknown directive type "highlight".

.. highlight:: html

Finally, the file size is contained in the second <p> tag inside the <div> tag with id=specifications:

<div id="specifications">
<p>
<strong>Category:</strong>
<a href="/cat/4">Movies</a> &gt; <a href="/sub/35">Documentary</a>
</p>
<p>
<strong>Total size:</strong>
699.79&nbsp;megabyte</p>

System Message: ERROR/3 (<string>, line 113)

Unknown directive type "highlight".

.. highlight:: none

An XPath expression to select the description could be:

//div[@id='specifications']/p[2]/text()[2]

System Message: ERROR/3 (<string>, line 119)

Unknown directive type "highlight".

.. highlight:: python

For more information about XPath see the XPath reference.

Finally, here's the spider code:

class MininovaSpider(CrawlSpider):
    domain_name = 'mininova.org'
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)
        torrent = TorrentItem()
        torrent['url'] = response.url
        torrent['name'] = x.select("//h1/text()").extract()
        torrent['description'] = x.select("//div[@id='description']").extract()
        torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
        return torrent

For brevity sake, we intentionally left out the import statements and the Torrent class definition (which is included some paragraphs above).

Write a pipeline to store the items extracted

Now let's write an :ref:`topics-item-pipeline` that serializes and stores the extracted item into a file using pickle:

System Message: ERROR/3 (<string>, line 150); backlink

Unknown interpreted text role "ref".
import pickle
class StoreItemPipeline(object):
    def process_item(self, domain, response, item):
        torrent_id = item['url'].split('/')[-1]
        f = open("torrent-%s.pickle" % torrent_id, "w")
        pickle.dump(item, f)
        f.close()

What else?

You've seen how to extract and store items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:

  • Built-in support for parsing HTML, XML, CSV, and Javascript
  • A media pipeline for scraping items with images (or any other media) and download the image files as well
  • Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
  • Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
  • Interactive scraping shell console, very useful for developing and debugging
  • Web management console for monitoring and controlling your bot
  • Telnet console for low-level access to the Scrapy process

What's next?

The next obvious steps are for you to download Scrapy, read :ref:`the tutorial <intro-tutorial>` and join the community. Thanks for your interest!

System Message: ERROR/3 (<string>, line 192); backlink

Unknown interpreted text role "ref".
Note: See TracBrowser for help on using the browser.