Skip to main content

ScrapyDeltaGuard - data-drift detection for Scrapy

Project description

🛡️ Delta Guard

Delta Guard is a lightweight, production-ready plugin for Scrapy projects that detects and manages data deltas (changes) between newly scraped items and existing database records.
It helps maintain data integrity, reduces false updates, and can automatically trigger alerts (e.g., JIRA tickets) when meaningful changes occur.


🚀 Why use Delta Guard?

Large-scale crawlers and ETL pipelines commonly face:

  • layout or markup changes on target sites,
  • formatting differences (phone 123-456 vs 123456),
  • transient incorrect data in sponsored or related blocks,
  • and other causes of noisy updates.

Delta Guard helps you:

  • Detect real content changes, not formatting noise.
  • Avoid cascading bad writes by stopping crawls if many fields drift.
  • Integrate without changing your existing item pipelines.
  • Optionally notify downstream systems (JIRA, Slack, email).

⚙️ Quick Start

Add the following to your Scrapy project's settings.py:

EXTENSIONS = {
    "delta_guard.extension.DeltaGuardExtension": 500,
}

DELTA_GUARD_ENABLED = True
DELTA_GUARD_DB_OBJECT = profile  # ORM instance, dict, or variable
DELTA_GUARD_DEFAULT_THRESHOLD = 0.05

DELTA_GUARD_FIELDS_CONFIG = [
    {"name": "phone"},
    {"name": "email"},
    {"name": "address", "threshold": 0.1},
]

DELTA_GUARD_BATCH_SIZE = 100
DELTA_GUARD_DB_NONE_IGNORE = True
DELTA_GUARD_SPIDER_NONE_IGNORE = False

# Optional Jira integration
# DELTA_GUARD_JIRA_FUNC = "myproject.utils.jira_ticket"

🧩 How It Works
Delta Guard compares each scraped item with its corresponding record from the database (using SQLAlchemy ORM, dicts, or variables).
For every field defined in DELTA_GUARD_FIELDS_CONFIG, it measures the difference between old and new values.
If a delta exceeds its threshold, it accumulates that difference in memory.
Every DELTA_GUARD_BATCH_SIZE items, it evaluates the overall delta for the batch.
If the drift surpasses acceptable limits:

  • Delta Guard halts the spider to prevent further corruption.
  • Optionally, it triggers a JIRA ticket or alert function.

Configuration Reference

Setting Type Default Description
DELTA_GUARD_ENABLED bool False Enables or disables Delta Guard.
DELTA_GUARD_DEFAULT_THRESHOLD float 0.05 Default allowed delta (5%).
DELTA_GUARD_FIELDS_CONFIG list[dict] Field-level delta config.
DELTA_GUARD_BATCH_SIZE int 100 Number of items per evaluation batch.
DELTA_GUARD_DB_NONE_IGNORE bool True Ignore if DB value is None.
DELTA_GUARD_SPIDER_NONE_IGNORE bool False Ignore if spider provides None.
DELTA_GUARD_JIRA_FUNC str None Optional dotted path to alert handler.
DELTA_GUARD_DB_OBJECT str/obj None ORM, dict, or variable name for DB object.

🧠 Example Behavior

Field DB Value Spider Value Result
phone 1234567890 123-456-7890 ✅ Ignored (minor format change)
email user@site.com user@fake.com ⚠️ Delta exceeds threshold (alert)
address None 123 Main St ✅ Ignored if DELTA_GUARD_DB_NONE_IGNORE=True

📦 Installation

pip install scrapy-delta-guard

Then enable it via your Scrapy project’s settings.py.

🧰 Optional Integration
If you wish to auto-create JIRA tickets (or alerts), define a handler:

# myproject/utils.py
def jira_ticket(title: str, description: str):
    print(f"[JIRA] {title}: {description}")

and set:
DELTA_GUARD_JIRA_FUNC = "myproject.utils.jira_ticket"


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_delta_guard-0.0.3.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_delta_guard-0.0.3-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_delta_guard-0.0.3.tar.gz.

File metadata

  • Download URL: scrapy_delta_guard-0.0.3.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for scrapy_delta_guard-0.0.3.tar.gz
Algorithm Hash digest
SHA256 6435e09578d662f915794be3eaad3637e48b602bc6b20db9c786f2f4c6e06f08
MD5 394ed860f3c8c2102c3ec014641aac81
BLAKE2b-256 8b60b1ca756ce5820be409b161ab9146dd2cb6c259e3fa95d9abd1d48629c28f

See more details on using hashes here.

File details

Details for the file scrapy_delta_guard-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_delta_guard-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 237044ca2134670c1b3d8c23192a8d30f42b11e5356f16a37145257005acb8aa
MD5 1a5fde00e1bc6d62bc68065679b71ebe
BLAKE2b-256 1ad50adef45994c192eb615c0c44089a193ce443db233ce527e5070df7ce7bf0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page