Skip to main content

Scrapy extension for database ingestion with job/spider tracking

Project description

Scrapy Item Ingest

A tiny, straightforward addon for Scrapy that saves your items, requests, and logs to PostgreSQL. No boilerplate, no ceremony.

Install

pip install scrapy-item-ingest

Minimal setup (settings.py)

ITEM_PIPELINES = {
    'scrapy_item_ingest.DbInsertPipeline': 300,
}

EXTENSIONS = {
    'scrapy_item_ingest.LoggingExtension': 500,
}

# Pick ONE of the two database config styles:
DB_URL = "postgresql://user:password@localhost:5432/database"
# Or use discrete fields (avoids URL encoding):
# DB_HOST = "localhost"
# DB_PORT = 5432
# DB_USER = "user"
# DB_PASSWORD = "password"
# DB_NAME = "database"

# Optional
CREATE_TABLES = True     # auto‑create tables on first run (default True)
JOB_ID = 1               # or omit; spider name will be used

Run your spider:

scrapy crawl your_spider

Troubleshooting

  • Password has special characters like @ or $?
    • In a URL, encode them: @ -> %40, $ -> %24.
    • Example: postgresql://user:PAK%40swat1%24@localhost:5432/db
    • Or use the discrete fields (no encoding needed).

Useful settings (optional)

  • LOG_DB_LEVEL (default: DEBUG) — minimum level stored in DB
  • LOG_DB_CAPTURE_LEVEL — capture level for Scrapy loggers routed to DB (does not affect console)
  • LOG_DB_LOGGERS — allowed logger prefixes (defaults always include [spider.name, 'scrapy'])
  • LOG_DB_EXCLUDE_LOGGERS (default: ['scrapy.core.scraper'])
  • LOG_DB_EXCLUDE_PATTERNS (default: ['Scraped from <'])
  • CREATE_TABLES (default: True) — create job_items, job_requests, job_logs on startup
  • ITEMS_TABLE, REQUESTS_TABLE, LOGS_TABLE — override table names

Links

License

MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_item_ingest-0.2.7.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_item_ingest-0.2.7-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_item_ingest-0.2.7.tar.gz.

File metadata

  • Download URL: scrapy_item_ingest-0.2.7.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for scrapy_item_ingest-0.2.7.tar.gz
Algorithm Hash digest
SHA256 c74e96c5b9f7a38a78457947bf8a60e47f57a91c73e0fd453f2af9ded0a70b2c
MD5 902e7d17a9b9ae78a393d0045fd87d7a
BLAKE2b-256 e14233d1f221e21bf6c9f6c154d5961e4b62c539783532122eac3b946b8b93f3

See more details on using hashes here.

File details

Details for the file scrapy_item_ingest-0.2.7-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_item_ingest-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 de7f64cd0a5d651399a6d3440104b38a923c288674116341f23f273a985881db
MD5 4a8020e2cb7be847b9537d9cbac86b86
BLAKE2b-256 b01f593e73dfc8b3c359f59f43808157149ea03b3d0b334fa4576ae18637ccb6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page