Saves Scrapy exceptions in your Database
Project description
scrapy-toolbox
A Python library that extends Scrapy with the following features:
- Error Saving to the Database Table "__errors" for manual error analysis (incl. traceback and response) and automated request reconstruction containing the following columns:
- failed_at
- spider
- traceback
- request_method
- request_url
- request_meta (json dump that can be loaded with json.loads())
- request_cookies (json dump that can be loaded with json.loads())
- request_headers (json dump that can be loaded with json.loads())
- request_body
- response_status
- response_url
- response_headers (json dump that can be loaded with json.loads())
- response_body
- DatabasePipeline for SQLAlchemy
Requisites:
- Environment variable "PRODUCTION" for Produciton Mode for instance in your Dockerfile
- The ErrorSavingMiddleware defines a errback Callback for your Requests. So if you want to make use of this Feature do not define any errback.
Installation
pip install scrapy-toolbox
Setup
Add scrapy_toolbox.error_handling.ErrorSavingMiddleware
extensions to your Scrapy Project settings.py
and set your DATABASE_DEV and DATABASE.
Example when using a MySQL Database:
# settings.py
SPIDER_MIDDLEWARES = {
'scrapy_toolbox.error_handling.ErrorSavingMiddleware': 1000,
}
DATABASE = {
'drivername': 'mysql+pymysql',
'username': '...',
'password': '...',
'database': '...',
'host': '...',
'port': '3306'
}
DATABASE_DEV = {
'drivername': 'mysql+pymysql',
'username': '...',
'password': '...',
'database': '...',
'host': '127.0.0.1',
'port': '3306'
}
Usage
# pipelines.py
from scrapy_toolbox.database import DatabasePipeline
class ScraperXYZPipeline(DatabasePipeline):
def process_item(self, item, spider):
...
# models.py
import scrapy_toolbox.database as db
# then use db.DeclarativeBase as your declarative base
class Car(db.DeclarativeBase):
...
Supported versions
This package works with Python 3. It has been tested with Scrapy up to version 1.4.0.
Tasklist
- [] Process Errors from your Database Table "errors" at a later time and execute failed Request: for instance when the website was down or you got an Exception during parsing for specific requests and want to crawl them again
Build Realease
python setup.py sdist bdist_wheel
twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapy-toolbox-0.0.6.tar.gz
(5.0 kB
view hashes)
Built Distribution
Close
Hashes for scrapy_toolbox-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea0852568b44d4004f6bb3512a1d445ecc0fb8e73be7205880c8ffb1f81a79ab |
|
MD5 | 24a2be8630b6f927e3b9cf4eccb29bd6 |
|
BLAKE2b-256 | eb7b6ff110a01449b30b458630fea6d6d9d4d1ecfb358216df1858526f7fcf0b |