Skip to main content

Scrapy extension for database ingestion with job/spider tracking

Project description

Scrapy Item Ingest

PyPI Version PyPI Downloads Supported Python Versions License: MIT

GitHub Stars GitHub Issues GitHub Last Commit

A comprehensive Scrapy extension for ingesting scraped items, requests, and logs into PostgreSQL databases with advanced tracking capabilities. This library provides a clean, production-ready solution for storing and monitoring your Scrapy crawling operations with real-time data ingestion and comprehensive logging.

Documentation

Full documentation is available at: https://scrapy-item-ingest.readthedocs.io/en/latest/

Key Features

  • 🔄 Real-time Data Ingestion: Store items, requests, and logs as they're processed
  • 📊 Request Tracking: Track request response times, fingerprints, and parent-child relationships
  • 🔍 Comprehensive Logging: Capture spider events, errors, and custom messages
  • 🏗️ Flexible Schema: Support for both auto-creation and existing table modes
  • ⚙️ Modular Design: Use individual components or the complete pipeline
  • 🛡️ Production Ready: Handles both development and production scenarios
  • 📝 JSONB Storage: Store complex item data as JSONB for flexible querying
  • 🐳 Docker Support: Complete containerization with Docker and Kubernetes
  • 📈 Performance Optimized: Connection pooling and batch processing
  • 🔧 Easy Configuration: Environment-based configuration with validation
  • 📊 Monitoring Ready: Built-in metrics and health checks

Installation

pip install scrapy-item-ingest

Development

Setting up for Development

git clone https://github.com/fawadss1/scrapy_item_ingest.git
cd scrapy_item_ingest
pip install -e ".[dev]"

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For support and questions:

Changelog

v0.2.0 (2025-11-11) — Current

  • Database connection: automatic DSN normalization to safely handle special characters in credentials (e.g., @, $) without modifying your settings
  • Unified DB access across pipelines and extensions via DatabaseConnection (singleton) with connect/execute/commit/rollback/close
  • Logging extension overhaul:
    • Capture Scrapy default (framework) logs in addition to spider logs
    • Attach DB handler to spider logger and top-level scrapy logger only to avoid duplicates via propagation
    • Console-like formatting using LOG_FORMAT and LOG_DATEFORMAT
    • Fine-grained filtering: allowlist by logger namespaces plus exclusions by logger and message substrings
    • Built-in de-duplication to suppress repeated lines within a small time window
    • Error throttling to stop DB logging after the first write failure (prevents spam)
  • Schema consistency: logs table consistently uses level column (not type)
  • Backwards compatibility: DatabaseConnection remains alias to DBConnection

New optional settings:

  • LOG_DB_LEVEL (default: DEBUG) — minimum level stored in DB
  • LOG_DB_CAPTURE_LEVEL (default: same as LOG_DB_LEVEL) — capture level for attached loggers (DB only; does not affect console)
  • LOG_DB_LOGGERS — additional allowed logger prefixes (defaults always include [spider.name, 'scrapy'])
  • LOG_DB_EXCLUDE_LOGGERS (default: ['scrapy.core.scraper'])
  • LOG_DB_EXCLUDE_PATTERNS (default: ['Scraped from <'])
  • LOG_DB_BATCH_SIZE — batch size for DB inserts
  • LOG_DB_DEDUP_TTL — seconds to suppress duplicate messages

v0.1.2

  • Initial release
  • Core pipeline functionality for items, requests, and logs
  • PostgreSQL database integration with JSONB storage
  • Comprehensive documentation and examples
  • Production deployment guides
  • Docker and Kubernetes support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_item_ingest-0.2.0.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_item_ingest-0.2.0-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_item_ingest-0.2.0.tar.gz.

File metadata

  • Download URL: scrapy_item_ingest-0.2.0.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for scrapy_item_ingest-0.2.0.tar.gz
Algorithm Hash digest
SHA256 bbf6fefaef502a4a50a89e8208b2e7dc8e5833ce41624113c73ed50fc2d78d9c
MD5 748deaa44230bc522de3ecb4b9a9d37c
BLAKE2b-256 54be88a4415efa32d14ae87a3d22bbd6eeed6acc320bcb8d348a6ec4544744d8

See more details on using hashes here.

File details

Details for the file scrapy_item_ingest-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_item_ingest-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0be32d90453418bad901a71f8b607f915af8f3dad928af9865235d8f15fe8619
MD5 f4c27ebeb2c3881626134fe8f5cd6898
BLAKE2b-256 5dd2c6f8a79cd12364e14de8fb976c4e0f1e0e087416d8b9521b7dacf4e44b74

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page