Saves Scrapy exceptions in your Database

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

scrapy-toolbox

A Python library that extends Scrapy with the following features:

Error Saving to the Database Table "__errors" for manual error analysis (incl. traceback and response) and automated request reconstruction containing the following columns:
- failed_at
- spider
- traceback
- request_method
- request_url
- request_meta (json dump that can be loaded with json.loads())
- request_cookies (json dump that can be loaded with json.loads())
- request_headers (json dump that can be loaded with json.loads())
- request_body
- response_status
- response_url
- response_headers (json dump that can be loaded with json.loads())
- response_body
Error Processing with request reconstruction
DatabasePipeline for SQLAlchemy

Requisites:

Environment variable "PRODUCTION" for Produciton Mode for instance in your Dockerfile
The ErrorSavingMiddleware defines a errback Callback for your Requests. If you want to make use of this Feature do not define any errback.

Installation

pip install --upgrade scrapy-toolbox

Setup

Add the scrapy_toolbox Middlewares to your Scrapy Project settings.py and set your DATABASE_DEV and DATABASE.

# settings.py
SPIDER_MIDDLEWARES = {
    'scrapy_toolbox.database.DatabasePipeline': 999,
    'scrapy_toolbox.error_handling.ErrorSavingMiddleware': 1000,
    'scrapy_toolbox.error_processing.ErrorProcessingMiddleware': 1000,
}

# Example when using a MySQL
DATABASE = {
  'drivername': 'mysql+pymysql', 
  'username': '...',
  'password': '...',
  'database': '...',
  'host': '...',
  'port': '3306'
}

DATABASE_DEV = {
    'drivername': 'mysql+pymysql',
    'username': '...',
    'password': '...',
    'database': '...',
    'host': '127.0.0.1',
    'port': '3306'
}

Usage

Database Pipeline:

# pipelines.py
from scrapy_toolbox.database import DatabasePipeline

class ScraperXYZPipeline(DatabasePipeline):
  def process_item(self, item, spider):
      ...

# models.py
import scrapy_toolbox.database as db

# then use db.DeclarativeBase as your declarative base
class Car(db.DeclarativeBase):
  ...

Query Data:

# spiderXYZ.py
session = self.crawler.database_session
session.query(models.Market.id, models.Market.zip_code).all()

Process Errors:

scrapy crawl spider_xyz -a process_errors=True
#scrapy-toolbox spider_xyz

Supported versions

This package works with Python 3. It has been tested with Scrapy up to version 1.4.0.

Tasklist

[] Process Errors from your Database Table "errors" at a later time and execute failed Request: for instance when the website was down or you got an Exception during parsing for specific requests and want to crawl them again

Build Realease

python setup.py sdist bdist_wheel
twine upload dist/*

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.4

Jul 19, 2021

0.3.3

May 28, 2021

0.3.2

May 5, 2021

0.3.1

Mar 31, 2021

0.3.0

Mar 31, 2021

0.2.3

Feb 25, 2021

0.2.2

Feb 25, 2021

0.2.1

Feb 22, 2021

0.2.0

Feb 8, 2021

This version

0.1.0

Jan 29, 2021

0.0.7

Dec 1, 2020

0.0.6

Nov 30, 2020

0.0.5

Oct 26, 2020

0.0.4

Oct 23, 2020

0.0.3

Oct 21, 2020

0.0.2

Oct 21, 2020

0.0.1

Oct 15, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-toolbox-0.1.0.tar.gz (4.5 kB view hashes)

Uploaded Jan 29, 2021 Source

Built Distribution

scrapy_toolbox-0.1.0-py3-none-any.whl (8.7 kB view hashes)

Uploaded Jan 29, 2021 Python 3

Hashes for scrapy-toolbox-0.1.0.tar.gz

Hashes for scrapy-toolbox-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`38140ddb46ab177f61f1d364ceef81975ae7fa7f277d2f3a78f521daecdbc99e`
MD5	`c0453a4d938eb7f2be6db5cb1947db90`
BLAKE2b-256	`b5bd8ff7aeeb83e0fff54ead673a1525826a9452a717a5c753bcaf36adfbc897`

Hashes for scrapy_toolbox-0.1.0-py3-none-any.whl

Hashes for scrapy_toolbox-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`095d7faaccd1d4b01b1d02c8927cb9c8c4c78981e47d0a24347c7304b70f9e6b`
MD5	`0a10f1cf6bb10ba4d2d203d667cdb912`
BLAKE2b-256	`e4710d1950d3164baeb20c448f11495193026f7057fcb2e764fad1f0b217d4d4`