Donate to support Ukraine's independence.

24 May'11

Scrapy 0.12 Parsing with python

Based on Scrapy Tutorial (dead link:

  1. Install scrapy and dependencies
sudo apt-get install python-lxml
sudo easy_install -U Scrapy
  1. Create project
scrapy startproject dmoz
  1. Create item models
from scrapy.item import Item, Field

class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()
  1. Create spiders (in projname/spiders/)
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dmoz.items import DmozItem

class DmozSpider(BaseSpider):
    name = ""
    allowed_domains = [""]
    start_urls = [

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites ='//ul/li')
        items = []
        for site in sites:
            item = DmozItem()
            item['title'] ='a/text()').extract()
            item['link'] ='a/@href').extract()
            item['desc'] ='text()').extract()
        return items
  1. Run spiders
scrapy crawl
scrapy crawl --set FEED_URI=items.json --set FEED_FORMAT=json
scrapy crawl --set FEED_URI=items.csv --set FEED_FORMAT=csv