Skip to main content

A detective for your data. Zero-config data quality monitoring.

Project description

Python 3.10+ PyPI MIT License CI



Scherlok

Scherlok

Your data broke in production. Again.
Scherlok makes sure it doesn't happen next time.

Scherlok Demo

Zero config. Zero YAML. Zero rules to write.
Scherlok learns what "normal" looks like, then tells you when something changes.


The Problem

Every data team has the same nightmare:

A source API silently changes from dollars to cents. Revenue dashboards show wrong numbers for 3 weeks before anyone notices.

A column starts returning NULLs. A table stops updating. Row counts drop 40% on a Tuesday. Nobody knows until the CEO asks why the report looks weird.

Current tools (Great Expectations, Soda, dbt tests) require you to define what "correct" looks like before you can detect what's wrong. Hundreds of rules. Dozens of YAML files. And you still miss things — because you can't write rules for problems you haven't imagined yet.

The Solution

Scherlok takes the opposite approach: learn first, then detect.

scherlok connect postgres://user:pass@host/db   # connect once
scherlok investigate                              # learn your data
scherlok watch                                    # detect anomalies

Three commands. Five minutes. Done.

What It Catches

Anomaly What Happened Severity
Volume drop Row count dropped 40% overnight CRITICAL
Volume spike 3x more rows than normal WARNING
Freshness alert Table hasn't updated in 12h (normally every 2h) CRITICAL
Schema drift Column removed or type changed CRITICAL
NULL surge NULL rate jumped from 2% to 45% WARNING
Distribution shift Column mean shifted 5+ standard deviations WARNING
Cardinality explosion Status column went from 5 values to 500 CRITICAL

Every anomaly is auto-scored: INFO, WARNING, or CRITICAL. No thresholds to configure.

How It Works

1. investigate — Learn the patterns

$ scherlok investigate

  Profiling 12 tables...
   users          45,231 rows, 8 columns
   orders         1,203,847 rows, 15 columns
   products       892 rows, 12 columns
  ...
  Done. Profiles saved.

Scherlok profiles every table: row counts, column types, NULL rates, value distributions, freshness cadence, cardinality. Stores everything locally in SQLite.

2. watch — Detect anomalies

$ scherlok watch

  Checking 12 tables against learned profiles...

  🔴 CRITICAL  orders    volume_drop     Row count dropped 52% (1,203,847  578,412)
  🟡 WARNING   users     null_increase   Column "email": NULL rate 2.1%  18.7%
  🔵 INFO      products  distribution    Column "price": mean shifted 3.2σ

  3 anomalies detected. Exit code: 1

3. Alert — Slack, CI/CD, or both

# Slack
scherlok watch --webhook https://hooks.slack.com/services/...

# Discord
scherlok watch --webhook https://discord.com/api/webhooks/...

# Microsoft Teams
scherlok watch --webhook https://outlook.office.com/webhook/...

# Any endpoint (generic JSON payload)
scherlok watch --webhook https://my-api.com/alerts

# CI/CD gate (fails pipeline on CRITICAL)
scherlok watch --exit-code --fail-on critical

Auto-detects Slack, Discord, and Teams from the URL and formats the payload accordingly. Any other URL receives a generic JSON payload.

CI/CD Integration

Use Scherlok as a data quality gate. The ci command does it in one line:

# GitHub Actions
- name: Data quality check
  run: |
    pip install scherlok
    scherlok config --store s3://my-bucket/scherlok/profiles.db
    scherlok ci ${{ secrets.DATABASE_URL }} \
      --webhook ${{ secrets.SLACK_WEBHOOK }} \
      --fail-on critical

If Scherlok detects a critical anomaly, the pipeline fails. Bad data never reaches production.

Email alerts

export SCHERLOK_SMTP_HOST=smtp.gmail.com
export SCHERLOK_SMTP_USER=alerts@company.com
export SCHERLOK_SMTP_PASSWORD=app-specific-password

scherlok watch --email team@company.com --email cto@company.com

Connectors

# PostgreSQL
scherlok connect postgres://user:pass@host:5432/db

# BigQuery
pip install scherlok[bigquery]
scherlok connect bigquery://project-id/dataset-name

# Snowflake
pip install scherlok[snowflake]
export SNOWFLAKE_USER=...
export SNOWFLAKE_PASSWORD=...
export SNOWFLAKE_WAREHOUSE=...
scherlok connect snowflake://account/database/schema
Database Status
PostgreSQL Available
BigQuery Available
Snowflake Available
MySQL Coming soon
DuckDB Planned

Remote Storage

Share profiles across CI runs and team members:

# AWS S3
scherlok config --store s3://my-bucket/scherlok/profiles.db

# Google Cloud Storage
scherlok config --store gs://my-bucket/scherlok/profiles.db

# Azure Blob Storage
scherlok config --store az://my-container/scherlok/profiles.db

Why Not [Other Tool]?

Great Expectations Soda Monte Carlo Scherlok
Setup time Hours 30 min Weeks 5 minutes
Config required Hundreds of rules YAML checks Dashboard setup None
Anomaly detection Manual thresholds Paid feature Yes Yes, free
Self-hosted Yes Limited No (SaaS) Yes
CI/CD gate Yes Yes No Yes
Price Free Freemium $50-200K/yr Free, forever

CLI Reference

scherlok connect <url>          Connect to a database
scherlok investigate            Profile all tables (learn patterns)
scherlok watch [-w <url>] [-e <email>]  Detect anomalies and alert
scherlok ci <url> [opts]        All-in-one CI/CD command (connect + watch + exit code)
scherlok status                 Quick health dashboard
scherlok report                 Detailed profile summary
scherlok history [--days N]     Timeline of past anomalies
scherlok config --store <url>   Set remote storage
scherlok version                Show version

Install

pip install scherlok

# With BigQuery support
pip install scherlok[bigquery]

Requires Python 3.10+.

Contributing

Contributions welcome! See CONTRIBUTING.md.

We're especially looking for:

  • New database connectors (Snowflake, MySQL, DuckDB)
  • Anomaly detection improvements
  • Documentation and examples

License

MIT — Developed by Robson Bayer Müller

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scherlok-0.4.0.tar.gz (791.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scherlok-0.4.0-py3-none-any.whl (34.6 kB view details)

Uploaded Python 3

File details

Details for the file scherlok-0.4.0.tar.gz.

File metadata

  • Download URL: scherlok-0.4.0.tar.gz
  • Upload date:
  • Size: 791.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scherlok-0.4.0.tar.gz
Algorithm Hash digest
SHA256 1d948d9985f1c062ed7bb9f6f63bb49e3c2ba114b10b9df6a699647ab7144a26
MD5 f42f057444593ceb0ee441bbda49cb5e
BLAKE2b-256 14093c866bae379b97d54128182cd6d606e69c2aeeb2bcb4610f77f73d427726

See more details on using hashes here.

Provenance

The following attestation bundles were made for scherlok-0.4.0.tar.gz:

Publisher: release.yml on rbmuller/scherlok

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scherlok-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: scherlok-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 34.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scherlok-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5eeebd78e6b91474a4680e2b3b9bbc7d0f27f82f39b0f061aa9663b56a0ba8e7
MD5 8ad6a1f447a72d1aeeef6a63d8ee7ae9
BLAKE2b-256 0f6775568ae580c5e2572931b1dda7c5847f7d233663d41bcf8740530338ab21

See more details on using hashes here.

Provenance

The following attestation bundles were made for scherlok-0.4.0-py3-none-any.whl:

Publisher: release.yml on rbmuller/scherlok

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page