ScrapyDeltaGuard - data-drift detection for Scrapy
Project description
🛡️ Delta Guard
Delta Guard is a lightweight, production-ready plugin for Scrapy projects that detects and manages data deltas (changes) between newly scraped items and existing database records.
It helps maintain data integrity, reduces false updates, and can automatically trigger alerts (e.g., JIRA tickets) when meaningful changes occur.
🚀 Why use Delta Guard?
Large-scale crawlers and ETL pipelines commonly face:
- layout or markup changes on target sites,
- formatting differences (phone
123-456vs123456), - transient incorrect data in sponsored or related blocks,
- and other causes of noisy updates.
Delta Guard helps you:
- Detect real content changes, not formatting noise.
- Avoid cascading bad writes by stopping crawls if many fields drift.
- Integrate without changing your existing item pipelines.
- Optionally notify downstream systems (JIRA, Slack, email).
⚙️ Quick Start
Add the following to your Scrapy project's settings.py:
DELTA_GUARD_ENABLED = True
DELTA_GUARD_DEFAULT_THRESHOLD = 0.05 # 5% default tolerance
DELTA_GUARD_FIELDS_CONFIG = [
{"name": "phone"}, # uses default threshold
{"name": "email", "threshold": 0.08}, # override threshold for email
]
# Optional: a runtime object (ORM instance, dict or variable name)
DELTA_GUARD_DB_OBJECT = None # set at runtime if desired
DELTA_GUARD_BATCH_SIZE = 100
DELTA_GUARD_DB_NONE_IGNORE = True
DELTA_GUARD_SPIDER_NONE_IGNORE = False
# Optional: dotted path to function to call on alert (title, description) or custom signature.
DELTA_GUARD_JIRA_FUNC = None
# Single pipeline entry that wraps user pipelines automatically
ITEM_PIPELINES = {
"delta_guard.pipeline.DeltaGuardAdapterPipeline": 500,
}
🧩 How It Works
Delta Guard compares each scraped item with its corresponding record from the database (using SQLAlchemy ORM, dicts, or variables).
For every field defined in DELTA_GUARD_FIELDS_CONFIG, it measures the difference between old and new values.
If a delta exceeds its threshold, it accumulates that difference in memory.
Every DELTA_GUARD_BATCH_SIZE items, it evaluates the overall delta for the batch.
If the drift surpasses acceptable limits:
- Delta Guard halts the spider to prevent further corruption.
- Optionally, it triggers a JIRA ticket or alert function.
⚡ Configuration Reference
| Setting | Type | Default | Description |
|---|---|---|---|
DELTA_GUARD_ENABLED |
bool |
False |
Enables or disables Delta Guard. |
DELTA_GUARD_DEFAULT_THRESHOLD |
float |
0.05 |
Default allowed delta (5%). |
DELTA_GUARD_FIELDS_CONFIG |
list[dict] |
– |
Field-level delta config. |
DELTA_GUARD_BATCH_SIZE |
int |
100 |
Number of items per evaluation batch. |
DELTA_GUARD_DB_NONE_IGNORE |
bool |
True |
Ignore if DB value is None. |
DELTA_GUARD_SPIDER_NONE_IGNORE |
bool |
False |
Ignore if spider provides None. |
DELTA_GUARD_JIRA_FUNC |
str |
None |
Optional dotted path to alert handler. |
DELTA_GUARD_DB_OBJECT |
str/obj |
None |
ORM, dict, or variable name for DB object. |
🧠 Example Behavior
| Field | DB Value | Spider Value | Result |
|---|---|---|---|
phone |
1234567890 |
123-456-7890 |
✅ Ignored (minor format change) |
email |
user@site.com |
user@fake.com |
⚠️ Delta exceeds threshold (alert) |
address |
None |
123 Main St |
✅ Ignored if DELTA_GUARD_DB_NONE_IGNORE=True |
📦 Installation
pip install scrapy-delta-guard
Then enable it via your Scrapy project’s settings.py.
🧰 Optional Integration
If you wish to auto-create JIRA tickets (or alerts), define a handler:
# myproject/utils.py
def jira_ticket(title: str, description: str):
print(f"[JIRA] {title}: {description}")
and set:
DELTA_GUARD_JIRA_FUNC = "myproject.utils.jira_ticket"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy_delta_guard-0.0.2.tar.gz.
File metadata
- Download URL: scrapy_delta_guard-0.0.2.tar.gz
- Upload date:
- Size: 6.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8746fb9284abeb65d5f70bf7cffcc7b58eef91d548b36836b53a27bb73ec0ab
|
|
| MD5 |
4c80e34077f5b74541720d9e6d00c794
|
|
| BLAKE2b-256 |
3c396930ae0387ea570267e1d9b4c9307da2fbf39538b781175ab55ca143ddd4
|
File details
Details for the file scrapy_delta_guard-0.0.2-py3-none-any.whl.
File metadata
- Download URL: scrapy_delta_guard-0.0.2-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
126b2d2b2eb68854eab3673ae61396292dff7a31c946480a0a80139c7cade200
|
|
| MD5 |
dd2662c74a3820fce558fdd8e8fb9a49
|
|
| BLAKE2b-256 |
890077206e9a87bc9c48ddc8a75dae7202679d0abab1071f6fc5f83ec836013b
|