Scrapy extension to store info in storage service
Project description
A scrapy extension to store requests and responses information in storage service.
Installation
You can install scrapy-pagestorage using pip:
pip install scrapy-pagestorage
You can then enable the middleware in your settings.py:
SPIDER_MIDDLEWARES = { ... 'scrapy_pagestorage.PageStorageMiddleware': 900 }
How to use it
Enable extension through settings.py:
PAGE_STORAGE_ENABLED = True PAGE_STORAGE_ON_ERROR_ENABLED = True
Configure the exension through settings.py:
PAGE_STORAGE_MODE = "VERSIONED_CACHE" PAGE_STORAGE_LIMIT = 100 PAGE_STORAGE_ON_ERROR_LIMIT = 100 PAGE_STORAGE_TRIM_HTML = True
The extension is auto-enabled for Portia spiders (SHUB_SPIDER_TYPE=portia).
Settings
PAGE_STORAGE_MODE
Default: None
A string which specifies if the extension will store information using cache store or versioned cache store (set PAGE_STORAGE_MODE=”VERSIONED_CACHE” to use versioned one).
PAGE_STORAGE_LIMIT
An integer to set a limit of visited pages amount to store.
PAGE_STORAGE_ON_ERROR_LIMIT
An integer to set a limit for page errors amount to store.
PAGE_STORAGE_TRIM_HTML
Default: False
Remove whitespace from the start and end of the HTML to reduce file size.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapy_pagestorage-0.3.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68fe66c2153aa9b2c85d0d26a160a91fc491b8bde19fc9568691e1d94e710e70 |
|
MD5 | a7f0e4523af5823b45c072dd3f96a4f6 |
|
BLAKE2b-256 | b10803e35ae0a8fb011cea49a01c9a7ddadf0ac7be45796ece5ba511532d7d1d |