Scrapy wiki and bug-report system
Welcome!. This is the wiki of Scrapy, an open source web crawling and screen scraping framework for Python.
Due to spam protection measures, you need to register to report bugs or modify the content of this wiki. Registration is very quick (just enter username and password). Please register and contribute to the Scrapy community!
If you're new to Scrapy, start by reading Scrapy at a glance. The information on this wiki should be considered complementary to the official documentation.
Guides and HOWTOs for Scrapy users
- Scraping AJAX sites
- Using Parsley - Using Parsley and Parselets with Scrapy
- Scrapy on Amazon EC2
- PyQt4 and scrapy connect - Using PyQt4 as frontend for scrapy applications
- Run Scrapy crawler in a thread - to prevent blocking, so it can be used from scripts or other software
- APT repositories - for installing Scrapy in Ubuntu-based platforms
Other User Resources
- Companies using Scrapy - list of companies and projects using Scrapy
- Community Spiders - Spiders for different sites contributed by the community (useful examples)
- Scrapy Recipes - Some code snippets for non-trivial tasks
- Scrapy 0.8 Changes - Comprehensive list of changes in Scrapy 0.8 (still in development)
Developer Resources
Project tracking and source code
- Trac Code Browser: Browse the source code using Trac web interface
- Project Timeline: View (and keep track of) recent changes to code, wiki and tickets
- Mercurial Web Interface: Browse the source code and changesets on the official Mercurial repo
- Scrapy repository on Bitbucket (mirror kindly setup by Patrick Mezard)
- Scrapy release procedure
Scrapy Enhancement Proposals
- SEP-001: API for populating item fields (comparison)
- SEP-002: List fields API
- SEP-003: Nested items fields API
- SEP-004: Library-like API
- SEP-005: Detailed ItemBuilder API use
- SEP-006: Extractors
- SEP-007: ItemLoader processors library
- SEP-008: Item Parsers
- SEP-009: Singletons removal
- SEP-010: REST API
- SEP-011: Process models
- SEP-012: Spider name
- SEP-013: Middlewares refactoring
- SEP-014: Crawl Spider v2
Getting the code
Scrapy uses Mercurial (hg) for managing its code.
Assuming you have Mercurial installed, the following command in a terminal will fetch the most recent code for you:
hg clone http://hg.scrapy.org/scrapy
Getting involved
If you want to get involved, feel free to check out the code and start playing with it (you can also fork Scrapy on bitbucket), and take a look at the community resources if you want to talk with other Scrapy developers. We welcome any kind of feedback and discussions on Scrapy improvements.
