This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

scrapy-boilerplate is a small set of utilities for Scrapy to simplify writing low-complexity spiders that are very common in small and one-off projects.

It requires Scrapy (>= 0.16) and has been tested using python 2.7. Additionally, PyQuery is required to run the scripts in the examples directory.

Note

The code is experimental, includes some magic under the hood and might be hard to debug. If you are new to Scrapy, don’t use this code unless you are ready to debug errors that nobody have seen before.

Usage Guide

Items

Standard item definition:

from scrapy.item import Item, Field

class BaseItem(Item):
    url = Field()
    crawled = Field()

class UserItem(BaseItem):
    name = Field()
    about = Field()
    location = Field()

class StoryItem(BaseItem):
    title = Field()
    body = Field()
    user = Field()

Becomes:

from scrapy_boilerplate import NewItem

BaseItem = NewItem('url crawled')

UserItem = NewItem('name about location', base_cls=BaseItem)

StoryItem = NewItem('title body user', base_cls=BaseItem)

BaseSpider

Standard spider definition:

from scrapy.spider import BaseSpider

class MySpider(BaseSpider):
    name = 'my_spider'
    start_urls = ['http://example.com/latest']

    def parse(self, response):
        # do stuff

Becomes:

from scrapy_boilerplate import NewSpider

MySpider = NewSpider('my_spider')

@MySpider.scrape('http://example.com/latest')
def parse(spider, response):
    # do stuff

CrawlSpider

Standard crawl-spider definition:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class MySpider(CrawlSpider):
    name = 'my_spider'
    start_urls = ['http://example.com']

    rules = (
        Rule(SgmlLinkExtractor('category\.php'), follow=True),
        Rule(SgmlLinkExtractor('item\.php'), callback='parse_item'),
    )

    def parse_item(self, response):
        # do stuff

Becomes:

from scrapy_boilerplate import NewCrawlSpider

MySpider = NewCrawlSpider('my_spider')
MySpider.follow('category\.php')

@MySpider.rule('item\.php')
def parse_item(spider, response):
    # do stuff

Running Helpers

Single-spider running script:

# file: my-spider.py
# imports omitted ...

class MySpider(BaseSpider):
    # spider code ...

if __name__ == '__main__':
    from scrapy_boilerplate import run_spider
    custom_settings = {
        # ...
    }
    spider = MySpider()

    run_spider(spider, custom_settings)

Multi-spider script with standard crawl command line options:

# file: my-crawler.py
# imports omitted ...


class MySpider(BaseSpider):
    name = 'my_spider'
    # spider code ...


class OtherSpider(CrawlSpider):
    name = 'other_spider'
    # spider code ...


if __name__ == '__main__':
    from scrapy_boilerplate import run_crawler, SpiderManager
    custom_settings = {
        # ...
    }

    SpiderManager.register(MySpider)
    SpiderManager.register(OtherSpider)

    run_crawler(custom_settings)

Note

See the examples directory for working code examples.

Release History

Release History

0.2.1

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
scrapy-boilerplate-0.2.1.tar.gz (5.0 kB) Copy SHA256 Checksum SHA256 Source Feb 4, 2013

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting