Skip to main content

A Scrapy extension to detect data changes (deltas) between scraped items and a database.

Project description

🛡️ Scrapy DeltaGuard

Scrapy DeltaGuard is a powerful and easy-to-integrate Scrapy extension that monitors changes (deltas) between the data you scrape and the data already present in your database.

It helps maintain data integrity by monitoring data drift, reduces noisy or incorrect updates, and can automatically trigger alerts (e.g., to Slack or Jira) or stop a spider when significant changes occur.

🚀 Why use DeltaGuard?

Large-scale web scraping projects commonly face issues that corrupt data quality:

  • Silent layout changes on target websites that break selectors.
  • Inconsistent formatting for data like phone numbers (123-456 vs 123456).
  • Transient incorrect data appearing in sponsored or related content blocks.

DeltaGuard helps you:

  • Detect real content changes, not just formatting noise.
  • Avoid cascading bad data writes by automatically halting crawls if too many fields drift.
  • Integrate seamlessly without rewriting your existing item pipeline logic.
  • Notify downstream systems (Jira, Slack, etc.) when data quality issues are found.

Installation

pip install scrapy-delta-guard

⚙️ Quick Start Guide

Follow these three steps to get DeltaGuard running in your project.

1. Configure Your Database Session for Detached Object Handling

To avoid SQLAlchemy’s DetachedInstanceError during delta checking, configure your SQLAlchemy session with:

from sqlalchemy.orm import sessionmaker

Session = sessionmaker(bind=engine, expire_on_commit=False)

This retains loaded database objects on commit, allowing access during Scrapy’s async pipeline.

2. Configure settings.py

Enable the extension and define field monitoring with flexible thresholds and options.

EXTENSIONS = {
    'deltaguard.extension.DeltaGuard': 500,
}

DELTA_GUARD_ENABLED = True

DELTA_GUARD_BATCH_SIZE = 50

DELTA_GUARD_DEFAULT_THRESHOLD = '5%'

DELTA_GUARD_FIELDS_CONFIG = [
    {'name': 'email'},  # simple shorthand for same db/spider field
    {'name': 'phone_number', 'threshold': 10},  # 10% threshold
    {
        'name': 'years_experience',
        'db_var': 'years_exp',  # different db attribute
        'spider_var': 'years_exp_spider',  # different spider field
        'threshold': '15%'
    },
]

DELTA_GUARD_DB_NONE_IS_DELTA = True
DELTA_GUARD_SPIDER_NONE_IS_DELTA = False

DELTA_GUARD_STOP_SPIDER_ON_HIGH_DELTA = True

DELTA_GUARD_JIRA_FUNC = 'my_project.utils.create_jira_ticket'

DELTA_GUARD_SLACK_WEBHOOK = 'https://hooks.slack.com/services/your/webhook/url'

LOG_LEVEL = 'DEBUG'

3. Update Your Scrapy Item

Ensure your Scrapy Item class includes the db_item field to avoid KeyError:

import scrapy

class YourItem(scrapy.Item):
    # ... your existing fields ...
    db_item = scrapy.Field()

4. Attach DB Items Using the Adapter in Your Pipelines

from deltaguard.adapter import DeltaGuardAdapter

class YourPipeline:
    def process_item(self, item, spider):
        db_item = self.session.query(YourModel).filter_by(email=item.get('email')).first()
        DeltaGuardAdapter.attach(item, db_item)
        return item

How Does DeltaGuard Work?

  • The extension compares the fields in DELTA_GUARD_FIELDS_CONFIG between the scraped item and its corresponding database record.
  • Differences are accumulated in batch sized groups defined by DELTA_GUARD_BATCH_SIZE.
  • If deltas for any specific field exceed their configured percentage threshold during a batch, alerts are sent.
  • Optionally, the spider is stopped immediately to prevent cascading bad data writes.

Configuration Reference

Setting Type Default Description
DELTA_GUARD_ENABLED bool False Enables or disables the extension globally.
DELTA_GUARD_FIELDS_CONFIG list[dict] [] Fields to monitor with optional threshold, db_var, and spider_var.
DELTA_GUARD_BATCH_SIZE int 50 Number of items processed per batch evaluation.
DELTA_GUARD_DEFAULT_THRESHOLD str or float 5% Default batch delta threshold (percentage) if none specified per field.
DELTA_GUARD_DB_NONE_IS_DELTA bool False Treats a None in DB as delta if spider has a value.
DELTA_GUARD_SPIDER_NONE_IS_DELTA bool False Treats a None in spider as delta if DB has a value.
DELTA_GUARD_STOP_SPIDER_ON_HIGH_DELTA bool True Stops the spider when any field delta threshold is exceeded.
DELTA_GUARD_JIRA_FUNC str None Dotted path to alert function (e.g., JIRA ticket creator).
DELTA_GUARD_SLACK_WEBHOOK str None Slack Incoming Webhook URL for notifications.

Advanced Field Configuration

The DELTA_GUARD_FIELDS_CONFIG allows flexible definitions.

DELTA_GUARD_FIELDS_CONFIG = [
    {'name': 'email'},                 # Simple shorthand
    {'name': 'phone_number', 'db_var': 'phone', 'spider_var': 'contact_phone'},  # Custom fields
    {'name': 'salary', 'threshold': 15},  # 15% threshold as integer
    {'name': 'location', 'threshold': '25%'},  # 25% threshold as string
]

Using safe_commit to Prevent Data Corruption

If we enable DELTA_GUARD_STOP_SPIDER_ON_HIGH_DELTA it will gracefully stop the spider , however it will process the requests in queue and may write a few more batches of data. To prevent it we can use the safe_commit utility.

The safe_commit utility ensures you only commit your SQLAlchemy session if DeltaGuard has not flagged a high-delta event. If the flag was set, it automatically rolls back the session to prevent potentially bad/partial data saves.

Example usage in your pipeline:

from deltaguard.adapter import safe_commit

class YourDatabasePipeline:
def close_spider(self, spider):
# This checks the flag and decides to commit or rollback
safe_commit(self.session, spider)
self.session.close()

If you ever need to force a commit even when a high delta was detected:

safe_commit(self.session, spider, force_commit=True)

  • The function returns True if committed, False if rolled back.

This wrapper lets you centralize all your session commit logic and align with how DeltaGuard protects your database from corrupt/incomplete batches.

CSV Delta Logs on Slack (Optional)

DeltaGuard can automatically generate and send detailed CSV logs to Slack when high delta thresholds are exceeded. This gives you a complete audit trail of what changed.

Configuration

In your settings.py

Enable CSV logs on Slack
DELTA_GUARD_LOGS_ON_SLACK = True

Required: Slack Bot Token (not webhook)
DELTA_GUARD_SLACK_BOT_TOKEN = "xoxb-your-bot-token-here"

Required: Slack Channel ID (not channel name)
Right Click on Channel >> View Channel Details
DELTA_GUARD_SLACK_CHANNEL_ID = "C01234ABCDE"

Setting up Slack Bot for File Uploads

To enable CSV file uploads, you need a Slack Bot Token (webhooks don't support file uploads):

  1. Create a Slack App:

  2. Add Bot Permissions:

    • Navigate to "OAuth & Permissions"
    • Under "Bot Token Scopes", add:
      • files:write (to upload files)
      • chat:write (to post messages)
  3. Install the App:

    • Click "Install to Workspace"
    • Copy the "Bot User OAuth Token" (starts with xoxb-)
  4. Get Your Channel ID:

    • In Slack, right-click your channel name
    • Select "View channel details"
    • Scroll to bottom and copy the Channel ID
  5. Invite Bot to Channel:

    • In your Slack channel, type: /invite @DeltaGuard Bot

CSV Format

The generated CSV includes the following columns (sorted by field name):

  • db_item_id: Primary key of the database record
  • field: Field name that changed
  • old_value: Value in the database
  • new_value: Value from the spider

Note: Only deltas from fields that exceeded the threshold are included in the CSV.

Example Output

When a high delta is detected, you'll receive:

  1. A text alert showing which fields exceeded thresholds
  2. A CSV file attachment with detailed change logs

The CSV filename format: deltaguard_{spider_name}_deltas.csv

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_delta_guard-1.0.0.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_delta_guard-1.0.0-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_delta_guard-1.0.0.tar.gz.

File metadata

  • Download URL: scrapy_delta_guard-1.0.0.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for scrapy_delta_guard-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c410a4b63d71ef63ae622aa179cecb652af69ed8abc9ea4965b1ff6b6a5dee28
MD5 0fb46f38cd56d937e7d820b2d3db13b8
BLAKE2b-256 0042bcf6b51c73ea3829a60a77cfcb8525ef9a25cc04ff4a9b54aeb790bb2a2c

See more details on using hashes here.

File details

Details for the file scrapy_delta_guard-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_delta_guard-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3d2a3e94b9e6d7b0cb34c925a7c3d6280b586e66a915eaf0b81396526223322d
MD5 b6477d7c22d526a1bd73e707b9e19c87
BLAKE2b-256 c937028f8c9d6fefb07eb8299abe22d1af92d96c8f67b8356ac74f1b130917ed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page