Ticket #99 (closed enhancement: fixed)

Opened 11 months ago

Last modified 7 weeks ago

Refactor link extractors with pluggable URL canonicalizers

Reported by: pablo Owned by: rolando
Priority: major Milestone: 0.9
Component: code Version:
Keywords: Cc: dan pablo

Description (last modified by pablo) (diff)

We need to refactor link extractors with pluggable URL canonicalizers.

Here are some ideas for URL canonicalizers:
http://www.sugarrae.com/be-a-normalizer-a-c14n-exterminator/

We already follow most of them, but it would be good to double check our canonicalization policies with those on that page, and make the rules modular so each user can decide which ones to use.

We need to write a SEP for this.

Change History

Changed 8 months ago by pablo

  • owner changed from pablo to ismael
  • status changed from new to assigned
  • summary changed from Make URL canonicalizers pluggable to Refactor link extractors with pluggable URL canonicalizers
  • description modified (diff)
  • milestone set to 0.9

Changed 5 months ago by pablo

  • owner changed from ismael to rolando

This will be done as part of #141 (Crawlspider v2).

Changed 7 weeks ago by pablo

  • status changed from assigned to closed
  • resolution set to fixed

This was done as part of CrawlSpider-v2, but we'll probably revisit it after we introduce the new LegSpider SEP-016. We'll re-open the ticket then, or create a new one.

Note: See TracTickets for help on using tickets.