Skip to main content

Small set of utilities to simplify writing Scrapy spiders.

Project description

scrapy-boilerplate is a small set of utilities for Scrapy to simplify writing low-complexity spiders that are very common in small and one-off projects.

It requires Scrapy (>= 0.16) and has been tested using python 2.7. Additionally, PyQuery is required to run the scripts in the examples directory.

Usage Guide

Items

Standard item definition:

from scrapy.item import Item, Field

class BaseItem(Item):
    url = Field()
    crawled = Field()

class UserItem(BaseItem):
    name = Field()
    about = Field()
    location = Field()

class StoryItem(BaseItem):
    title = Field()
    body = Field()
    user = Field()

Becomes:

from scrapy_boilerplate import NewItem

BaseItem = NewItem('url crawled')

UserItem = NewItem('name about location', base_cls=BaseItem)

StoryItem = NewItem('title body user', base_cls=BaseItem)

BaseSpider

Standard spider definition:

from scrapy.spider import BaseSpider

class MySpider(BaseSpider):
    name = 'my_spider'
    start_urls = ['http://example.com/latest']

    def parse(self, response):
        # do stuff

Becomes:

from scrapy_boilerplate import NewSpider

MySpider = NewSpider('my_spider')

@MySpider.scrape('http://example.com/latest')
def parse(spider, response):
    # do stuff

CrawlSpider

Standard crawl-spider definition:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class MySpider(CrawlSpider):
    name = 'my_spider'
    start_urls = ['http://example.com']

    rules = (
        Rule(SgmlLinkExtractor('category\.php'), follow=True),
        Rule(SgmlLinkExtractor('item\.php'), callback='parse_item'),
    )

    def parse_item(self, response):
        # do stuff

Becomes:

from scrapy_boilerplate import NewCrawlSpider

MySpider = NewCrawlSpider('my_spider')
MySpider.follow('category\.php')

@MySpider.rule('item\.php')
def parse_item(spider, response):
    # do stuff

Running Helpers

Single-spider running script:

# file: my-spider.py
# imports omitted ...

class MySpider(BaseSpider):
    # spider code ...

if __name__ == '__main__':
    from scrapy_boilerplate import run_spider
    custom_settings = {
        # ...
    }
    spider = MySpider()

    run_spider(spider, custom_settings)

Multi-spider script with standard crawl command line options:

# file: my-crawler.py
# imports omitted ...


class MySpider(BaseSpider):
    name = 'my_spider'
    # spider code ...


class OtherSpider(CrawlSpider):
    name = 'other_spider'
    # spider code ...


if __name__ == '__main__':
    from scrapy_boilerplate import run_crawler, SpiderManager
    custom_settings = {
        # ...
    }

    SpiderManager.register(MySpider)
    SpiderManager.register(OtherSpider)

    run_crawler(custom_settings)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-boilerplate-0.2.1.tar.gz (5.0 kB view details)

Uploaded Source

File details

Details for the file scrapy-boilerplate-0.2.1.tar.gz.

File metadata

File hashes

Hashes for scrapy-boilerplate-0.2.1.tar.gz
Algorithm Hash digest
SHA256 49860bcb31ad8d7516e9f322b89b34ae708b4f373b588477cbb65eee3a273078
MD5 9c484b430d39ae2298acfb275f9ebcf7
BLAKE2b-256 654f8901d1fc946dc6bc27e8abe472d298db36da18db3e63793938383499dd6a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page