Skip to main content

Scrapy extension for database ingestion with job/spider tracking

Project description

Scrapy Item Ingest

A tiny, straightforward addon for Scrapy that saves your items, requests, and logs to PostgreSQL. No boilerplate, no ceremony.

Install

pip install scrapy-item-ingest

Minimal setup (settings.py)

ITEM_PIPELINES = {
    'scrapy_item_ingest.DbInsertPipeline': 300,
}

EXTENSIONS = {
    'scrapy_item_ingest.LoggingExtension': 500,
}

# Pick ONE of the two database config styles:
DB_URL = "postgresql://user:password@localhost:5432/database"
# Or use discrete fields (avoids URL encoding):
# DB_HOST = "localhost"
# DB_PORT = 5432
# DB_USER = "user"
# DB_PASSWORD = "password"
# DB_NAME = "database"

# Optional
CREATE_TABLES = True     # auto‑create tables on first run (default True)
JOB_ID = 1               # or omit; spider name will be used

Run your spider:

scrapy crawl your_spider

Troubleshooting

  • Password has special characters like @ or $?
    • In a URL, encode them: @ -> %40, $ -> %24.
    • Example: postgresql://user:PAK%40swat1%24@localhost:5432/db
    • Or use the discrete fields (no encoding needed).

Useful settings (optional)

  • LOG_DB_LEVEL (default: DEBUG) — minimum level stored in DB
  • LOG_DB_CAPTURE_LEVEL — capture level for Scrapy loggers routed to DB (does not affect console)
  • LOG_DB_LOGGERS — allowed logger prefixes (defaults always include [spider.name, 'scrapy'])
  • LOG_DB_EXCLUDE_LOGGERS (default: ['scrapy.core.scraper'])
  • LOG_DB_EXCLUDE_PATTERNS (default: ['Scraped from <'])
  • CREATE_TABLES (default: True) — create job_items, job_requests, job_logs on startup
  • ITEMS_TABLE, REQUESTS_TABLE, LOGS_TABLE — override table names

Links

License

MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_item_ingest-0.2.4.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_item_ingest-0.2.4-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_item_ingest-0.2.4.tar.gz.

File metadata

  • Download URL: scrapy_item_ingest-0.2.4.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for scrapy_item_ingest-0.2.4.tar.gz
Algorithm Hash digest
SHA256 ffaba06f7a513ef04a99c89bdeaefa38c70eecd3191d8a7e2c0fe23e328057b2
MD5 c5c8c5583a511b02125cacebf2d70c45
BLAKE2b-256 f0d9a681108542c38f7b5f81b3d167a7eae109fe761d5e7add9b0d520259eb9b

See more details on using hashes here.

File details

Details for the file scrapy_item_ingest-0.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_item_ingest-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 fdc5ea25205777b2ad4aee0fb348837499daa4e0edcc9a6e3c2fd7599db1fa00
MD5 a3c40e95430439853aa59e69f2066f1d
BLAKE2b-256 57ad4350ea4afc5d65cd5c8c8aba30fd762f849783f52a38f6d077e1176cda27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page