Production-grade dataset auditing and ML readiness scoring library
Project description
DataWatcher
Production-grade dataset auditing and ML readiness scoring library.
DataWatcher runs a comprehensive battery of 22+ audits across your dataset — checking structure, data quality, statistical properties, categorical features, and ML-specific risks — then produces an overall ML Readiness Score (0–100) and a prioritized Risk Summary.
Installation
pip install datawatcher
For PDF report export support:
pip install "datawatcher[pdf]"
Quick Start
Python API
import datawatcher
# Audit a CSV file
results = datawatcher.audit_csv("train.csv", target="survived")
print(results["ml_readiness"])
# {'score': 84, 'grade': 'GOOD', 'total_penalty': 16.0, ...}
print(results["risk_summary"])
# {'risk_level': 'LOW', 'top_risks': ['missing_value_audit'], ...}
# Access individual audit results
for audit in results["audit_results"]:
print(audit.audit_name, audit.severity, audit.passed)
Audit an in-memory DataFrame
import pandas as pd
import datawatcher
df = pd.read_csv("transactions.csv")
results = datawatcher.audit_dataframe(
df,
target="churn",
domain="finance" # activates finance-specific audits
)
Domain-specific auditing
# Healthcare domain adds: age range, BMI, blood pressure,
# heart rate, lab results, missing diagnosis, medication consistency
results = datawatcher.audit_csv(
"patients.csv",
target="readmitted",
domain="healthcare"
)
# Finance domain adds: negative values, currency consistency,
# interest rate validity, balance consistency
results = datawatcher.audit_csv(
"loans.csv",
target="default",
domain="finance"
)
# Time series domain adds: duplicate timestamp detection
results = datawatcher.audit_csv(
"sensor_data.csv",
domain="timeseries"
)
CLI Usage
After installation, the datawatcher command is available globally:
# Basic audit
datawatcher audit run data.csv
# With target column
datawatcher audit run data.csv --target label
# With domain plugin
datawatcher audit run data.csv --target label --domain healthcare
# Export reports
datawatcher audit run data.csv --target label --export-html --export-pdf --export-json
Audit Catalog
Structural (4 audits)
| Audit | Checks |
|---|---|
shape_audit |
Row and column counts |
dtype_audit |
Data type summary per column |
memory_usage_audit |
Dataset memory footprint |
schema_consistency_audit |
Mixed types within columns |
Quality (5 audits)
| Audit | Threshold | Source |
|---|---|---|
missing_value_audit |
LOW >3%, MEDIUM >15% | Google TFDV |
duplicate_audit |
LOW >0.5%, MEDIUM >5% | AWS Deequ |
constant_feature_audit |
Any constant column | — |
near_constant_audit |
>95% single value | scikit-learn |
invalid_value_audit |
Inf/NaN/unrealistic values | — |
Statistical (5 audits)
| Audit | Threshold | Source |
|---|---|---|
descriptive_stats_audit |
Observational (no penalty) | — |
variance_audit |
Variance < 0.001 | scikit-learn VarianceThreshold |
skewness_audit |
|skew| ≥ 1.0 | Hair et al. (2010) |
kurtosis_audit |
Excess kurtosis > 7 | DeCarlo (1997) |
outlier_audit |
LOW >0.5% rows, MEDIUM >2% rows | IBM Research / TFDV |
Categorical (3 audits)
| Audit | Threshold |
|---|---|
category_frequency_audit |
Observational |
rare_category_audit |
Category < 0.5% frequency |
category_imbalance_audit |
Dominant category > 70% |
ML (5 audits)
| Audit | Threshold | Source |
|---|---|---|
cardinality_audit |
> 30% unique values | Industry ML best practice |
identifier_risk_audit |
> 90% unique values | GDPR / ML risk |
target_validation_audit |
Target column validity | — |
class_imbalance_audit |
Majority class > 75% | Japkowicz & Stephen (2002) |
leakage_audit |
|Pearson r| > 0.90 with target | Industry standard |
ML Readiness Score
Score = 100 − Σ(severity_weight × audit_weight)
Severity weights: INFO=0, LOW=3, MEDIUM=7, HIGH=15, CRITICAL=25
Audit weights (examples): leakage=3.0, target_validation=3.0, invalid_values=2.0
Grades:
≥ 90 → EXCELLENT
≥ 75 → GOOD
≥ 60 → FAIR
< 60 → POOR
Extending with Custom Audits
from datawatcher import BaseAudit, AuditResult, AuditRegistry, AuditEngine
from datawatcher import audit_dataframe
class MyCustomAudit(BaseAudit):
audit_name = "my_custom_audit"
category = "custom"
def run(self, dataset, context=None):
df = dataset.df
# ... your logic ...
return AuditResult(
audit_name=self.audit_name,
category=self.category,
passed=True,
severity="INFO",
findings={"message": "All good"},
recommendations=[]
)
# Use programmatically
registry = AuditRegistry()
registry.register(MyCustomAudit())
from datawatcher.core.audit_engine import AuditEngine
from datawatcher.loaders.factory import load_dataset
dataset = load_dataset("data.csv")
engine = AuditEngine(registry)
results = engine.run(dataset, context={"target": "label"})
Return Value Structure
audit_csv() and audit_dataframe() return:
{
"audit_results": [AuditResult, ...],
"ml_readiness": {
"score": 84,
"grade": "GOOD",
"total_penalty": 16.0,
"severity_breakdown": {...}
},
"risk_summary": {
"risk_level": "LOW",
"top_risks": ["audit_name", ...],
"high_risk_audits": [...],
"medium_risk_audits": [...]
},
"metadata": {
"rows": 10000,
"columns": 25,
"memory_usage_mb": 4.2
},
"semantic_types": {
"column_name": "numeric",
...
}
}
License
MIT © Ranjeet
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datawatcher_ml-0.1.5.tar.gz.
File metadata
- Download URL: datawatcher_ml-0.1.5.tar.gz
- Upload date:
- Size: 51.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67516570835b061a3b6c23e05efaa2508a0b12f18d7977f3329f5d3ba8755436
|
|
| MD5 |
89e90cf581b860327b50a9fa7557d769
|
|
| BLAKE2b-256 |
d949f4290630af2fad72d986a1ff755b5fdfbf73eb31a8b1f232e9cf9d646de4
|
File details
Details for the file datawatcher_ml-0.1.5-py3-none-any.whl.
File metadata
- Download URL: datawatcher_ml-0.1.5-py3-none-any.whl
- Upload date:
- Size: 79.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d51e7dc4f896416a1237d469715229b5001f71fb01887c3c3d117efa72990e45
|
|
| MD5 |
9e7cd94ce114bd2007734d4186e6ca30
|
|
| BLAKE2b-256 |
eaf39f065bdd55d14793dd74899d2c692b797b09bd032528145b6e58ec8f4b7a
|