Skip to main content

Scrapy extension for database ingestion with job/spider tracking

Project description

Scrapy Item Ingest

PyPI Version PyPI Downloads Supported Python Versions License: MIT

GitHub Stars GitHub Issues GitHub Last Commit

A comprehensive Scrapy extension for ingesting scraped items, requests, and logs into PostgreSQL databases with advanced tracking capabilities. This library provides a clean, production-ready solution for storing and monitoring your Scrapy crawling operations with real-time data ingestion and comprehensive logging.

Documentation

Full documentation is available at: https://scrapy-item-ingest.readthedocs.io/en/latest/

Key Features

  • 🔄 Real-time Data Ingestion: Store items, requests, and logs as they're processed
  • 📊 Request Tracking: Track request response times, fingerprints, and parent-child relationships
  • 🔍 Comprehensive Logging: Capture spider events, errors, and custom messages
  • 🏗️ Flexible Schema: Support for both auto-creation and existing table modes
  • ⚙️ Modular Design: Use individual components or the complete pipeline
  • 🛡️ Production Ready: Handles both development and production scenarios
  • 📝 JSONB Storage: Store complex item data as JSONB for flexible querying
  • 🐳 Docker Support: Complete containerization with Docker and Kubernetes
  • 📈 Performance Optimized: Connection pooling and batch processing
  • 🔧 Easy Configuration: Environment-based configuration with validation
  • 📊 Monitoring Ready: Built-in metrics and health checks

Installation

pip install scrapy-item-ingest

Development

Setting up for Development

git clone https://github.com/fawadss1/scrapy_item_ingest.git
cd scrapy_item_ingest
pip install -e ".[dev]"

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For support and questions:

Changelog

v0.1.1 (Current)

  • Initial release
  • Core pipeline functionality for items, requests, and logs
  • PostgreSQL database integration with JSONB storage
  • Comprehensive documentation and examples
  • Production deployment guides
  • Docker and Kubernetes support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_item_ingest-0.1.1.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_item_ingest-0.1.1-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_item_ingest-0.1.1.tar.gz.

File metadata

  • Download URL: scrapy_item_ingest-0.1.1.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for scrapy_item_ingest-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8cc4e63509e7e4069d0807ba32d7fb5b40718889f2162dfefe20a1a76244e180
MD5 dbc6382e8108915aa17177e2fa648671
BLAKE2b-256 3440f8479659c077bdd43a4709db62074ae4fdf9d15ea0a7aedeb076f0c71039

See more details on using hashes here.

File details

Details for the file scrapy_item_ingest-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_item_ingest-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7a4429e4415a6600d609efbdeb9e792af7f295f479601f218b1fd9d92f00bd2b
MD5 f086130257eb53a29013388a53b17111
BLAKE2b-256 404b5668802d22d2c14ed98419dbb3416bf6a5883b384fc5ab4e87382ab6b2c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page