Saves Scrapy exceptions in your Database

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

scrapy-toolbox

A Python library that extends Scrapy with the following features:

Support for Google App Engine (GAE): Bypass Google App Engines 24 hour execution time limit (https://cloud.google.com/appengine/docs/standard/go/how-instances-are-managed#scaling_types) by dividing the start_urls into x parts
Error Saving to the Database Table "errors" for manual error analysis (incl. traceback and response) and automated request reconstruction containing the following columns:
- failed_at
- spider
- traceback
- request_method
- request_url
- request_meta (json dump that can be loaded with json.loads())
- request_cookies (json dump that can be loaded with json.loads())
- request_headers (json dump that can be loaded with json.loads())
- request_body
- response_status
- response_url
- response_headers (json dump that can be loaded with json.loads())
- response_body

Requisites:

settings.py with a dict for DATABASE_DEV and DATABASE

Installation

pip install scrapy-toolbox

Setup

Add scrapy_toolbox.error_handling.ErrorSavingMiddleware and scrapy_toolbox.google_app_engine_support.GaePartCalcMiddleware extensions to your Scrapy Project settings.py.

Example:

# settings.py
SPIDER_MIDDLEWARES = {
    'scrapy_toolbox.error_handling.ErrorSavingMiddleware': 1000,
    'scrapy_toolbox.google_app_engine_support.GaePartCalcMiddleware': 1000,
}

Usage

The ErrorSavingMiddleware defines a errback Callback for your Requests. So if you want to make use of this Feature do not define any errback.
To split the start_urls into x parts just start your spider by adding the two arguments part and number_of_parts where part is the part which should be executed during this run and number_of_parts defines the number of parts that exist in total. So if you want to split your start_urls into 3 Parts:

  scrapy crawl test -a part=1 -a number_of_parts=3
  scrapy crawl test -a part=2 -a number_of_parts=3
  scrapy crawl test -a part=3 -a number_of_parts=3

Supported versions

This package works with Python 3. It has been tested with Scrapy up to version 1.4.0.

Tasklist

[] Process Errors from your Database Table "errors" at a later time and execute failed Request: for instance when the website was down or you got an Exception during parsing for specific requests and want to crawl them again
[] Automatic Part calculation and saving in DB???

Build Realease

python setup.py sdist twine upload dist/*

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.4

Jul 19, 2021

0.3.3

May 28, 2021

0.3.2

May 5, 2021

0.3.1

Mar 31, 2021

0.3.0

Mar 31, 2021

0.2.3

Feb 25, 2021

0.2.2

Feb 25, 2021

0.2.1

Feb 22, 2021

0.2.0

Feb 8, 2021

0.1.0

Jan 29, 2021

0.0.7

Dec 1, 2020

0.0.6

Nov 30, 2020

This version

0.0.5

Oct 26, 2020

0.0.4

Oct 23, 2020

0.0.3

Oct 21, 2020

0.0.2

Oct 21, 2020

0.0.1

Oct 15, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-toolbox-0.0.5.tar.gz (4.5 kB view hashes)

Uploaded Oct 26, 2020 Source

Hashes for scrapy-toolbox-0.0.5.tar.gz

Hashes for scrapy-toolbox-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`c6a4063a4d3649daf5988724abd182d3cb8daec328e9869c14454e1fb28ce789`
MD5	`87149e060f116f6aa4dbbcbca5df0cd9`
BLAKE2b-256	`9042ceeeb037c3124ee76494971c6148f689931d1d5ebac4024a86e42c6bac3a`