Small set of utilities to simplify writing Scrapy spiders.
Project description
scrapy-boilerplate is a small set of utilities for Scrapy to simplify writing low-complexity spiders that are very common in small and one-off projects.
It requires Scrapy (>= 0.16) and has been tested using python 2.7. Additionally, PyQuery is required to run the scripts in the examples directory.
Usage Guide
Items
Standard item definition:
from scrapy.item import Item, Field
class BaseItem(Item):
url = Field()
crawled = Field()
class UserItem(BaseItem):
name = Field()
about = Field()
location = Field()
class StoryItem(BaseItem):
title = Field()
body = Field()
user = Field()
Becomes:
from scrapy_boilerplate import NewItem
BaseItem = NewItem('url crawled')
UserItem = NewItem('name about location', base_cls=BaseItem)
StoryItem = NewItem('title body user', base_cls=BaseItem)
BaseSpider
Standard spider definition:
from scrapy.spider import BaseSpider
class MySpider(BaseSpider):
name = 'my_spider'
start_urls = ['http://example.com/latest']
def parse(self, response):
# do stuff
Becomes:
from scrapy_boilerplate import NewSpider
MySpider = NewSpider('my_spider')
@MySpider.scrape('http://example.com/latest')
def parse(spider, response):
# do stuff
CrawlSpider
Standard crawl-spider definition:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MySpider(CrawlSpider):
name = 'my_spider'
start_urls = ['http://example.com']
rules = (
Rule(SgmlLinkExtractor('category\.php'), follow=True),
Rule(SgmlLinkExtractor('item\.php'), callback='parse_item'),
)
def parse_item(self, response):
# do stuff
Becomes:
from scrapy_boilerplate import NewCrawlSpider
MySpider = NewCrawlSpider('my_spider')
MySpider.follow('category\.php')
@MySpider.rule('item\.php')
def parse_item(spider, response):
# do stuff
Running Helpers
Single-spider running script:
# file: my-spider.py
# imports omitted ...
class MySpider(BaseSpider):
# spider code ...
if __name__ == '__main__':
from scrapy_boilerplate import run_spider
custom_settings = {
# ...
}
spider = MySpider()
run_spider(spider, custom_settings)
Multi-spider script with standard crawl command line options:
# file: my-crawler.py
# imports omitted ...
class MySpider(BaseSpider):
name = 'my_spider'
# spider code ...
class OtherSpider(CrawlSpider):
name = 'other_spider'
# spider code ...
if __name__ == '__main__':
from scrapy_boilerplate import run_crawler, SpiderManager
custom_settings = {
# ...
}
SpiderManager.register(MySpider)
SpiderManager.register(OtherSpider)
run_crawler(custom_settings)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file scrapy-boilerplate-0.2.1.tar.gz
.
File metadata
- Download URL: scrapy-boilerplate-0.2.1.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 49860bcb31ad8d7516e9f322b89b34ae708b4f373b588477cbb65eee3a273078 |
|
MD5 | 9c484b430d39ae2298acfb275f9ebcf7 |
|
BLAKE2b-256 | 654f8901d1fc946dc6bc27e8abe472d298db36da18db3e63793938383499dd6a |