Skip to main content

ScrapyDeltaGuard - data-drift detection for Scrapy

Project description

🛡️ Delta Guard

Delta Guard is a lightweight, production-ready plugin for Scrapy projects that detects and manages data deltas (changes) between newly scraped items and existing database records.
It helps maintain data integrity, reduces false updates, and can automatically trigger alerts (e.g., JIRA tickets) when meaningful changes occur.


🚀 Why use Delta Guard?

Large-scale crawlers and ETL pipelines commonly face:

  • layout or markup changes on target sites,
  • formatting differences (phone 123-456 vs 123456),
  • transient incorrect data in sponsored or related blocks,
  • and other causes of noisy updates.

Delta Guard helps you:

  • Detect real content changes, not formatting noise.
  • Avoid cascading bad writes by stopping crawls if many fields drift.
  • Integrate without changing your existing item pipelines.
  • Optionally notify downstream systems (JIRA, Slack, email).

⚙️ Quick Start

Add the following to your Scrapy project's settings.py:

DELTA_GUARD_ENABLED = True
DELTA_GUARD_DEFAULT_THRESHOLD = 0.05   # 5% default tolerance
DELTA_GUARD_FIELDS_CONFIG = [
    {"name": "phone"},                 # uses default threshold
    {"name": "email", "threshold": 0.08},  # override threshold for email
]
# Optional: a runtime object (ORM instance, dict or variable name)
DELTA_GUARD_DB_OBJECT = None           # set at runtime if desired
DELTA_GUARD_BATCH_SIZE = 100
DELTA_GUARD_DB_NONE_IGNORE = True
DELTA_GUARD_SPIDER_NONE_IGNORE = False
# Optional: dotted path to function to call on alert (title, description) or custom signature.
DELTA_GUARD_JIRA_FUNC = None

# Single pipeline entry that wraps user pipelines automatically
ITEM_PIPELINES = {
    "delta_guard.pipeline.DeltaGuardAdapterPipeline": 500,
}

🧩 How It Works
Delta Guard compares each scraped item with its corresponding record from the database (using SQLAlchemy ORM, dicts, or variables).
For every field defined in DELTA_GUARD_FIELDS_CONFIG, it measures the difference between old and new values.
If a delta exceeds its threshold, it accumulates that difference in memory.
Every DELTA_GUARD_BATCH_SIZE items, it evaluates the overall delta for the batch.
If the drift surpasses acceptable limits:

  • Delta Guard halts the spider to prevent further corruption.
  • Optionally, it triggers a JIRA ticket or alert function.

Configuration Reference

Setting Type Default Description
DELTA_GUARD_ENABLED bool False Enables or disables Delta Guard.
DELTA_GUARD_DEFAULT_THRESHOLD float 0.05 Default allowed delta (5%).
DELTA_GUARD_FIELDS_CONFIG list[dict] Field-level delta config.
DELTA_GUARD_BATCH_SIZE int 100 Number of items per evaluation batch.
DELTA_GUARD_DB_NONE_IGNORE bool True Ignore if DB value is None.
DELTA_GUARD_SPIDER_NONE_IGNORE bool False Ignore if spider provides None.
DELTA_GUARD_JIRA_FUNC str None Optional dotted path to alert handler.
DELTA_GUARD_DB_OBJECT str/obj None ORM, dict, or variable name for DB object.

🧠 Example Behavior

Field DB Value Spider Value Result
phone 1234567890 123-456-7890 ✅ Ignored (minor format change)
email user@site.com user@fake.com ⚠️ Delta exceeds threshold (alert)
address None 123 Main St ✅ Ignored if DELTA_GUARD_DB_NONE_IGNORE=True

📦 Installation

pip install scrapy-delta-guard

Then enable it via your Scrapy project’s settings.py.

🧰 Optional Integration
If you wish to auto-create JIRA tickets (or alerts), define a handler:

# myproject/utils.py
def jira_ticket(title: str, description: str):
    print(f"[JIRA] {title}: {description}")

and set:
DELTA_GUARD_JIRA_FUNC = "myproject.utils.jira_ticket"


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_delta_guard-0.0.2.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_delta_guard-0.0.2-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_delta_guard-0.0.2.tar.gz.

File metadata

  • Download URL: scrapy_delta_guard-0.0.2.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for scrapy_delta_guard-0.0.2.tar.gz
Algorithm Hash digest
SHA256 e8746fb9284abeb65d5f70bf7cffcc7b58eef91d548b36836b53a27bb73ec0ab
MD5 4c80e34077f5b74541720d9e6d00c794
BLAKE2b-256 3c396930ae0387ea570267e1d9b4c9307da2fbf39538b781175ab55ca143ddd4

See more details on using hashes here.

File details

Details for the file scrapy_delta_guard-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_delta_guard-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 126b2d2b2eb68854eab3673ae61396292dff7a31c946480a0a80139c7cade200
MD5 dd2662c74a3820fce558fdd8e8fb9a49
BLAKE2b-256 890077206e9a87bc9c48ddc8a75dae7202679d0abab1071f6fc5f83ec836013b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page