Skip to main content

One-command dataset health diagnostics - the Ruff of datasets.

Project description

dataruff

CI codecov PyPI version Python License: MIT

The Ruff of datasets. One command to discover, explain, score, and fix data quality problems in Pandas DataFrames and CSV/Excel files.

from dataruff import audit

audit(df)
Data Quality Score: 81/100

Issues Found (5):
  !      42 duplicate rows
  ~      13 invalid email  (column: email)
  !       3 empty columns
  ~       7 outlier  (column: salary)
  .       2 inconsistent date format  (column: created_at)

Rows: 10,000 | Columns: 12

Install

pip install dataruff

Optionally install rich for prettier terminal output:

pip install dataruff[rich]

Quick start

import pandas as pd
from dataruff import audit, fix, score, validate, detect_pii

df = pd.read_csv("customers.csv")

# Full health report
audit(df)

# Get numeric score
s = score(df)
print(s.overall)   # 81
print(s.to_dict()) # {'overall': 81, 'completeness': 92, ...}

# Auto-fix common issues
clean_df = fix(df)

# Validate against a schema
result = validate(df, schema={
    "email": "email",
    "age":   "0-120",
    "id":    "unique",
})

# PII detection
report = detect_pii(df)
print(report.columns_with_pii)
# {'email': ['email'], 'phone': ['phone'], 'uid': ['aadhaar']}

API reference

Function Description Returns
audit(df) Print full health report InvestigationReport
investigate(df) Structured issue breakdown InvestigationReport
score(df) Data quality score ScoreBreakdown
fix(df) Auto-repair common issues pd.DataFrame
validate(df, schema) Check schema constraints dict
compare(old, new) Diff two datasets ComparisonReport
detect_pii(df) Find PII columns PIIReport
mask_pii(df) Redact PII values pd.DataFrame
detect_drift(old, new) Distribution drift analysis DriftReport
find_anomalies(df) Anomaly / outlier detection dict

All functions accept a DataFrame, CSV path, or XLSX path as input.


Scoring formula

Dimension Weight Measures
Completeness 25% Non-null ratio across all cells
Validity 25% Format correctness (emails, dates, types)
Consistency 20% Uniform types and formats per column
Uniqueness 20% Absence of duplicate rows
Schema compliance 10% Adherence to user-provided schema

fix() — what gets repaired

Issue Fix applied
Duplicate rows Removed
Leading/trailing whitespace Stripped
Boolean strings (yes/no/true/false) Converted to bool
Mixed date formats Normalized to YYYY-MM-DD
Missing numeric values Filled with column median
Missing string values Filled with column mode

validate() — schema rules

validate(df, schema={
    "email":   "email",          # valid email format
    "age":     "0-120",          # numeric range
    "user_id": "unique",         # no duplicates
    "price":   "positive",       # > 0
    "code":    "not_null",       # no missing values
    "ref":     "regex:[A-Z]{3}", # custom regex
})

detect_pii() — supported PII types

Type Example
email alice@example.com
phone 9876543210
aadhaar 2345 6789 0123
pan ABCDE1234F
ssn 123-45-6789
credit_card 4111 1111 1111 1111

CLI

# Audit a CSV file
dataruff audit customers.csv

# Output as JSON
dataruff audit customers.csv --json

# Fix issues and write cleaned file
dataruff fix customers.csv
# -> customers_clean.csv

# Compare two datasets
dataruff compare old.csv new.csv

# Data quality score
dataruff score customers.csv

# PII detection
dataruff detect-pii customers.csv

# Mask PII
dataruff mask-pii customers.csv
# -> customers_masked.csv

Architecture

dataruff/
├── analyzers/       # DuplicateAnalyzer, NullAnalyzer, TypeAnalyzer,
│                    # FormatAnalyzer, OutlierAnalyzer, PIIAnalyzer, DriftAnalyzer
├── scoring/         # Weighted scoring engine
├── fixing/          # Auto-remediation rules
└── reporting/       # Terminal (rich + plain fallback) and JSON output

No LLMs. No API calls. Everything deterministic and offline.


Requirements

  • Python 3.10+
  • pandas, numpy, scipy, scikit-learn, openpyxl, python-dateutil

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataruff-0.1.1.tar.gz (26.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataruff-0.1.1-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file dataruff-0.1.1.tar.gz.

File metadata

  • Download URL: dataruff-0.1.1.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for dataruff-0.1.1.tar.gz
Algorithm Hash digest
SHA256 628134024c33bb59281343eb30362af98ff0f3cfbdc7f700bf278c11bda373c1
MD5 1dc2177ecb7bd4c7102523078b7b90ab
BLAKE2b-256 bcee1999e5d735e1e9031108ebc87f95f8fdfa19568fb103d24af2ae879752eb

See more details on using hashes here.

File details

Details for the file dataruff-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dataruff-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for dataruff-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 af426d8a55a6d2b9f22257c74c052f1d832db6e309a50829b8248a9f53fa9847
MD5 ce5109b34b996a84275f17033fdfd906
BLAKE2b-256 e63b9e6095c47b5ba06a9c3c2322e5cb14c0c8392d5b1501eb82413c8bbe188e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page