Skip to main content

One-command dataset health diagnostics — the Ruff of datasets.

Project description

dataruff

CI codecov PyPI version Python License: MIT

The Ruff of datasets. One command to discover, explain, score, and fix data quality problems in Pandas DataFrames and CSV/Excel files.

from datadoctor import audit

audit(df)
Data Quality Score: 81/100

Issues Found (5):
  !      42 duplicate rows
  ~      13 invalid email  (column: email)
  !       3 empty columns
  ~       7 outlier  (column: salary)
  .       2 inconsistent date format  (column: created_at)

Rows: 10,000 | Columns: 12

Install

pip install dataruff

Optionally install rich for prettier terminal output:

pip install dataruff[rich]

Quick start

import pandas as pd
from datadoctor import audit, fix, score, validate, detect_pii

df = pd.read_csv("customers.csv")

# Full health report
audit(df)

# Get numeric score
s = score(df)
print(s.overall)   # 81
print(s.to_dict()) # {'overall': 81, 'completeness': 92, ...}

# Auto-fix common issues
clean_df = fix(df)

# Validate against a schema
result = validate(df, schema={
    "email": "email",
    "age":   "0-120",
    "id":    "unique",
})

# PII detection
report = detect_pii(df)
print(report.columns_with_pii)
# {'email': ['email'], 'phone': ['phone'], 'uid': ['aadhaar']}

API reference

Function Description Returns
audit(df) Print full health report InvestigationReport
investigate(df) Structured issue breakdown InvestigationReport
score(df) Data quality score ScoreBreakdown
fix(df) Auto-repair common issues pd.DataFrame
validate(df, schema) Check schema constraints dict
compare(old, new) Diff two datasets ComparisonReport
detect_pii(df) Find PII columns PIIReport
mask_pii(df) Redact PII values pd.DataFrame
detect_drift(old, new) Distribution drift analysis DriftReport
find_anomalies(df) Anomaly / outlier detection dict

All functions accept a DataFrame, CSV path, or XLSX path as input.


Scoring formula

Dimension Weight Measures
Completeness 25% Non-null ratio across all cells
Validity 25% Format correctness (emails, dates, types)
Consistency 20% Uniform types and formats per column
Uniqueness 20% Absence of duplicate rows
Schema compliance 10% Adherence to user-provided schema

fix() — what gets repaired

Issue Fix applied
Duplicate rows Removed
Leading/trailing whitespace Stripped
Boolean strings (yes/no/true/false) Converted to bool
Mixed date formats Normalized to YYYY-MM-DD
Missing numeric values Filled with column median
Missing string values Filled with column mode

validate() — schema rules

validate(df, schema={
    "email":   "email",          # valid email format
    "age":     "0-120",          # numeric range
    "user_id": "unique",         # no duplicates
    "price":   "positive",       # > 0
    "code":    "not_null",       # no missing values
    "ref":     "regex:[A-Z]{3}", # custom regex
})

detect_pii() — supported PII types

Type Example
email alice@example.com
phone 9876543210
aadhaar 2345 6789 0123
pan ABCDE1234F
ssn 123-45-6789
credit_card 4111 1111 1111 1111

CLI

# Audit a CSV file
dataruff audit customers.csv

# Output as JSON
dataruff audit customers.csv --json

# Fix issues and write cleaned file
dataruff fix customers.csv
# -> customers_clean.csv

# Compare two datasets
dataruff compare old.csv new.csv

# Data quality score
dataruff score customers.csv

# PII detection
dataruff detect-pii customers.csv

# Mask PII
dataruff mask-pii customers.csv
# -> customers_masked.csv

Architecture

datadoctor/
├── analyzers/       # DuplicateAnalyzer, NullAnalyzer, TypeAnalyzer,
│                    # FormatAnalyzer, OutlierAnalyzer, PIIAnalyzer, DriftAnalyzer
├── scoring/         # Weighted scoring engine
├── fixing/          # Auto-remediation rules
└── reporting/       # Terminal (rich + plain fallback) and JSON output

No LLMs. No API calls. Everything deterministic and offline.


Requirements

  • Python 3.10+
  • pandas, numpy, scipy, scikit-learn, openpyxl, python-dateutil

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataruff-0.1.0.tar.gz (26.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataruff-0.1.0-py3-none-any.whl (24.8 kB view details)

Uploaded Python 3

File details

Details for the file dataruff-0.1.0.tar.gz.

File metadata

  • Download URL: dataruff-0.1.0.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for dataruff-0.1.0.tar.gz
Algorithm Hash digest
SHA256 28f49b6e4e9d7d5a5211401d3878f5f8733f889883bd0e63256943070426e2d7
MD5 155362038b9c694a4de2d26aa9f26cc5
BLAKE2b-256 ed21cd74465e3ac0e417df33cd8a5909c15587dcc0c5135d6017d5d2e831db91

See more details on using hashes here.

File details

Details for the file dataruff-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataruff-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for dataruff-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 03bebea4a5068ebb1256a605e5697ad02c22c0c9514f900433ea42c2818ae221
MD5 994e303771886adf9d661cbde03815cc
BLAKE2b-256 d0ca3d1e2fc73c0ebaedf450e109298ccd5e54e7c7e39f349337942fe72c4baf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page