Skip to main content

Error Handling and Processing for your Scrapy Exceptions

Project description

scrapy-toolbox

A Python library that extends Scrapy with the following features:

  • Error Saving to the Database Table "__errors" for manual error analysis (incl. traceback and response) and automated request reconstruction containing the following columns:
    • failed_at
    • spider
    • traceback
    • url (original url)
    • request_method
    • request_url
    • request_meta (json dump that can be loaded with json.loads())
    • request_cookies (json dump that can be loaded with json.loads())
    • request_headers (json dump that can be loaded with json.loads())
    • request_body
    • response_status
    • response_url
    • response_headers (json dump that can be loaded with json.loads())
    • response_body
  • Error Processing with request reconstruction
  • DatabasePipeline for SQLAlchemy
  • Mapper to automaticaly map scrapy.Item on a database-object
  • Mail Notification when an Exception occurs (HTTP Errors (404, 502, ...) are excluded and only stored in the Database)
  • Automatic GitHub Issue creation when an Exception occurs (HTTP Errors (404, 502, ...) are excluded and only stored in the Database)

Requisites:

  • Environment variable "PRODUCTION" for Produciton Mode for instance in your Dockerfile
  • The ErrorSavingMiddleware defines an errback Callback for your Requests. If you want to make use of this Feature do not define any errback.

Installation

pip install --upgrade scrapy-toolbox

Example Project

You can find an example project here.

Setup

Add the scrapy_toolbox Middlewares to your Scrapy Project settings.py and set your DATABASE_DEV and DATABASE.

# settings.py
SPIDER_MIDDLEWARES = {
    'scrapy_toolbox.database.DatabasePipeline': 999,
    'scrapy_toolbox.error_handling.ErrorSavingMiddleware': 1000,
    'scrapy_toolbox.error_processing.ErrorProcessingMiddleware': 1000,
}

# Example when using a MySQL
DATABASE = {
  'drivername': 'mysql+pymysql', 
  'username': '...',
  'password': '...',
  'database': '...',
  'host': '...',
  'port': '3306'
}

DATABASE_DEV = {
    'drivername': 'mysql+pymysql',
    'username': '...',
    'password': '...',
    'database': '...',
    'host': '127.0.0.1',
    'port': '3306'
}

CREATE_GITHUB_ISSUE = True # Toggle GitHub Issue creation
GITHUB_TOKEN = "..."
GITHUB_REPO = "janwendt/scrapy-toolbox" # for instance

SEND_MAILS = True # Toggle Mail Notification
MAIL_HOST = "..."
MAIL_FROM = "..."
MAIL_TO = "..."

Usage

Spider (Import ErrorCatcher first!!!):

from scrapy_toolbox.error_handling import ErrorCatcher
import scrapy
...

class XyzSpider(scrapy.Spider, metaclass=ErrorCatcher):
...

Database Pipeline:

# pipelines.py
from scrapy_toolbox.database import DatabasePipeline
import xy.items as items
import xy.model as model

class ScraperXYZPipeline(DatabasePipeline):
      def __init__(self, settings):
        super().__init__(settings, items, model)
# models.py
import scrapy_toolbox.database as db

# then use db.DeclarativeBase as your declarative base
class Car(db.DeclarativeBase):
  ...

Query Data:

# spiderXYZ.py
session = self.crawler.database_session
session.query(models.Market.id, models.Market.zip_code).all()

Process Errors:

scrapy crawl spider_xyz -a process_errors=True

Limitations

Syntax Errors in your settings.py are not handled.

Supported versions

This package works with Python 3. It has been tested with Scrapy up to version 1.4.0.

Tasklist

  • [] Error Processing
  • [] Scaffold for instance ItemPipeline

Build Realease

python setup.py sdist bdist_wheel
cd dist
pip install --upgrade --no-deps --force-reinstall scrapy_toolbox-0.3.3-py3-none-any.whl
cd ..
twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-toolbox-0.3.4.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

scrapy_toolbox-0.3.4-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-toolbox-0.3.4.tar.gz.

File metadata

  • Download URL: scrapy-toolbox-0.3.4.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.5

File hashes

Hashes for scrapy-toolbox-0.3.4.tar.gz
Algorithm Hash digest
SHA256 6c1e7ec9ffcdb925e56966eb7ff4a96b1d3b4784f510044c6d445a7f2e65284f
MD5 36243cc826205c250f61cd088f63ed6c
BLAKE2b-256 02f52c801b623cacb36186a1dcc81ee6aaff9167644e18513584d5d48a555d46

See more details on using hashes here.

File details

Details for the file scrapy_toolbox-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: scrapy_toolbox-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.5

File hashes

Hashes for scrapy_toolbox-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7bbae4bbd63c9c4f76e37532f437e4ab76052bdfaac7dbc11aff165b1c5cbc44
MD5 bf7b5c89b98cf4b5bc204ade907b833d
BLAKE2b-256 c069f3705aa382109651e5541980060ebb6d0db1e146e21124d53678fb1f2de6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page