Skip to main content

Production-grade dataset auditing and ML readiness scoring library

Project description

DataWatcher

Production-grade dataset auditing and ML readiness scoring library.

Python License: MIT

DataWatcher runs a comprehensive battery of 22+ audits across your dataset — checking structure, data quality, statistical properties, categorical features, and ML-specific risks — then produces an overall ML Readiness Score (0–100) and a prioritized Risk Summary.


Installation

pip install datawatcher

For PDF report export support:

pip install "datawatcher[pdf]"

Quick Start

Python API

import datawatcher

# Audit a CSV file
results = datawatcher.audit_csv("train.csv", target="survived")

print(results["ml_readiness"])
# {'score': 84, 'grade': 'GOOD', 'total_penalty': 16.0, ...}

print(results["risk_summary"])
# {'risk_level': 'LOW', 'top_risks': ['missing_value_audit'], ...}

# Access individual audit results
for audit in results["audit_results"]:
    print(audit.audit_name, audit.severity, audit.passed)

Audit an in-memory DataFrame

import pandas as pd
import datawatcher

df = pd.read_csv("transactions.csv")

results = datawatcher.audit_dataframe(
    df,
    target="churn",
    domain="finance"   # activates finance-specific audits
)

Domain-specific auditing

# Healthcare domain adds: age range, BMI, blood pressure,
# heart rate, lab results, missing diagnosis, medication consistency
results = datawatcher.audit_csv(
    "patients.csv",
    target="readmitted",
    domain="healthcare"
)

# Finance domain adds: negative values, currency consistency,
# interest rate validity, balance consistency
results = datawatcher.audit_csv(
    "loans.csv",
    target="default",
    domain="finance"
)

# Time series domain adds: duplicate timestamp detection
results = datawatcher.audit_csv(
    "sensor_data.csv",
    domain="timeseries"
)

CLI Usage

After installation, the datawatcher command is available globally:

# Basic audit
datawatcher audit run data.csv

# With target column
datawatcher audit run data.csv --target label

# With domain plugin
datawatcher audit run data.csv --target label --domain healthcare

# Export reports
datawatcher audit run data.csv --target label --export-html --export-pdf --export-json

Audit Catalog

Structural (4 audits)

Audit Checks
shape_audit Row and column counts
dtype_audit Data type summary per column
memory_usage_audit Dataset memory footprint
schema_consistency_audit Mixed types within columns

Quality (5 audits)

Audit Threshold Source
missing_value_audit LOW >3%, MEDIUM >15% Google TFDV
duplicate_audit LOW >0.5%, MEDIUM >5% AWS Deequ
constant_feature_audit Any constant column
near_constant_audit >95% single value scikit-learn
invalid_value_audit Inf/NaN/unrealistic values

Statistical (5 audits)

Audit Threshold Source
descriptive_stats_audit Observational (no penalty)
variance_audit Variance < 0.001 scikit-learn VarianceThreshold
skewness_audit |skew| ≥ 1.0 Hair et al. (2010)
kurtosis_audit Excess kurtosis > 7 DeCarlo (1997)
outlier_audit LOW >0.5% rows, MEDIUM >2% rows IBM Research / TFDV

Categorical (3 audits)

Audit Threshold
category_frequency_audit Observational
rare_category_audit Category < 0.5% frequency
category_imbalance_audit Dominant category > 70%

ML (5 audits)

Audit Threshold Source
cardinality_audit > 30% unique values Industry ML best practice
identifier_risk_audit > 90% unique values GDPR / ML risk
target_validation_audit Target column validity
class_imbalance_audit Majority class > 75% Japkowicz & Stephen (2002)
leakage_audit |Pearson r| > 0.90 with target Industry standard

ML Readiness Score

Score = 100 − Σ(severity_weight × audit_weight)

Severity weights: INFO=0, LOW=3, MEDIUM=7, HIGH=15, CRITICAL=25
Audit weights (examples): leakage=3.0, target_validation=3.0, invalid_values=2.0

Grades:
  ≥ 90 → EXCELLENT
  ≥ 75 → GOOD
  ≥ 60 → FAIR
   < 60 → POOR

Extending with Custom Audits

from datawatcher import BaseAudit, AuditResult, AuditRegistry, AuditEngine
from datawatcher import audit_dataframe

class MyCustomAudit(BaseAudit):
    audit_name = "my_custom_audit"
    category = "custom"

    def run(self, dataset, context=None):
        df = dataset.df
        # ... your logic ...
        return AuditResult(
            audit_name=self.audit_name,
            category=self.category,
            passed=True,
            severity="INFO",
            findings={"message": "All good"},
            recommendations=[]
        )

# Use programmatically
registry = AuditRegistry()
registry.register(MyCustomAudit())

from datawatcher.core.audit_engine import AuditEngine
from datawatcher.loaders.factory import load_dataset

dataset = load_dataset("data.csv")
engine = AuditEngine(registry)
results = engine.run(dataset, context={"target": "label"})

Return Value Structure

audit_csv() and audit_dataframe() return:

{
    "audit_results": [AuditResult, ...],   
    "ml_readiness": {
        "score": 84,                        
        "grade": "GOOD",                   
        "total_penalty": 16.0,
        "severity_breakdown": {...}
    },
    "risk_summary": {
        "risk_level": "LOW",              
        "top_risks": ["audit_name", ...],
        "high_risk_audits": [...],
        "medium_risk_audits": [...]
    },
    "metadata": {
        "rows": 10000,
        "columns": 25,
        "memory_usage_mb": 4.2
    },
    "semantic_types": {
        "column_name": "numeric",       
        ...
    }
}

License

MIT © Ranjeet

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datawatcher_ml-0.1.5.tar.gz (51.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datawatcher_ml-0.1.5-py3-none-any.whl (79.4 kB view details)

Uploaded Python 3

File details

Details for the file datawatcher_ml-0.1.5.tar.gz.

File metadata

  • Download URL: datawatcher_ml-0.1.5.tar.gz
  • Upload date:
  • Size: 51.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for datawatcher_ml-0.1.5.tar.gz
Algorithm Hash digest
SHA256 67516570835b061a3b6c23e05efaa2508a0b12f18d7977f3329f5d3ba8755436
MD5 89e90cf581b860327b50a9fa7557d769
BLAKE2b-256 d949f4290630af2fad72d986a1ff755b5fdfbf73eb31a8b1f232e9cf9d646de4

See more details on using hashes here.

File details

Details for the file datawatcher_ml-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: datawatcher_ml-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 79.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for datawatcher_ml-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 d51e7dc4f896416a1237d469715229b5001f71fb01887c3c3d117efa72990e45
MD5 9e7cd94ce114bd2007734d4186e6ca30
BLAKE2b-256 eaf39f065bdd55d14793dd74899d2c692b797b09bd032528145b6e58ec8f4b7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page