One-command dataset health diagnostics — the Ruff of datasets.

These details have not been verified by PyPI

Project description

dataruff

The Ruff of datasets. One command to discover, explain, score, and fix data quality problems in Pandas DataFrames and CSV/Excel files.

from datadoctor import audit

audit(df)

Data Quality Score: 81/100

Issues Found (5):
  !      42 duplicate rows
  ~      13 invalid email  (column: email)
  !       3 empty columns
  ~       7 outlier  (column: salary)
  .       2 inconsistent date format  (column: created_at)

Rows: 10,000 | Columns: 12

Install

pip install dataruff

Optionally install rich for prettier terminal output:

pip install dataruff[rich]

Quick start

import pandas as pd
from datadoctor import audit, fix, score, validate, detect_pii

df = pd.read_csv("customers.csv")

# Full health report
audit(df)

# Get numeric score
s = score(df)
print(s.overall)   # 81
print(s.to_dict()) # {'overall': 81, 'completeness': 92, ...}

# Auto-fix common issues
clean_df = fix(df)

# Validate against a schema
result = validate(df, schema={
    "email": "email",
    "age":   "0-120",
    "id":    "unique",
})

# PII detection
report = detect_pii(df)
print(report.columns_with_pii)
# {'email': ['email'], 'phone': ['phone'], 'uid': ['aadhaar']}

API reference

Function	Description	Returns
`audit(df)`	Print full health report	`InvestigationReport`
`investigate(df)`	Structured issue breakdown	`InvestigationReport`
`score(df)`	Data quality score	`ScoreBreakdown`
`fix(df)`	Auto-repair common issues	`pd.DataFrame`
`validate(df, schema)`	Check schema constraints	`dict`
`compare(old, new)`	Diff two datasets	`ComparisonReport`
`detect_pii(df)`	Find PII columns	`PIIReport`
`mask_pii(df)`	Redact PII values	`pd.DataFrame`
`detect_drift(old, new)`	Distribution drift analysis	`DriftReport`
`find_anomalies(df)`	Anomaly / outlier detection	`dict`

All functions accept a DataFrame, CSV path, or XLSX path as input.

Scoring formula

Dimension	Weight	Measures
Completeness	25%	Non-null ratio across all cells
Validity	25%	Format correctness (emails, dates, types)
Consistency	20%	Uniform types and formats per column
Uniqueness	20%	Absence of duplicate rows
Schema compliance	10%	Adherence to user-provided schema

`fix()` — what gets repaired

Issue	Fix applied
Duplicate rows	Removed
Leading/trailing whitespace	Stripped
Boolean strings (`yes/no/true/false`)	Converted to `bool`
Mixed date formats	Normalized to `YYYY-MM-DD`
Missing numeric values	Filled with column median
Missing string values	Filled with column mode

`validate()` — schema rules

validate(df, schema={
    "email":   "email",          # valid email format
    "age":     "0-120",          # numeric range
    "user_id": "unique",         # no duplicates
    "price":   "positive",       # > 0
    "code":    "not_null",       # no missing values
    "ref":     "regex:[A-Z]{3}", # custom regex
})

`detect_pii()` — supported PII types

Type	Example
`email`	`alice@example.com`
`phone`	`9876543210`
`aadhaar`	`2345 6789 0123`
`pan`	`ABCDE1234F`
`ssn`	`123-45-6789`
`credit_card`	`4111 1111 1111 1111`

CLI

# Audit a CSV file
dataruff audit customers.csv

# Output as JSON
dataruff audit customers.csv --json

# Fix issues and write cleaned file
dataruff fix customers.csv
# -> customers_clean.csv

# Compare two datasets
dataruff compare old.csv new.csv

# Data quality score
dataruff score customers.csv

# PII detection
dataruff detect-pii customers.csv

# Mask PII
dataruff mask-pii customers.csv
# -> customers_masked.csv

Architecture

datadoctor/
├── analyzers/       # DuplicateAnalyzer, NullAnalyzer, TypeAnalyzer,
│                    # FormatAnalyzer, OutlierAnalyzer, PIIAnalyzer, DriftAnalyzer
├── scoring/         # Weighted scoring engine
├── fixing/          # Auto-remediation rules
└── reporting/       # Terminal (rich + plain fallback) and JSON output

No LLMs. No API calls. Everything deterministic and offline.

Requirements

Python 3.10+
pandas, numpy, scipy, scikit-learn, openpyxl, python-dateutil

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

May 31, 2026

This version

0.1.0

May 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataruff-0.1.0.tar.gz (26.6 kB view details)

Uploaded May 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataruff-0.1.0-py3-none-any.whl (24.8 kB view details)

Uploaded May 31, 2026 Python 3

File details

Details for the file dataruff-0.1.0.tar.gz.

File metadata

Download URL: dataruff-0.1.0.tar.gz
Upload date: May 31, 2026
Size: 26.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for dataruff-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`28f49b6e4e9d7d5a5211401d3878f5f8733f889883bd0e63256943070426e2d7`
MD5	`155362038b9c694a4de2d26aa9f26cc5`
BLAKE2b-256	`ed21cd74465e3ac0e417df33cd8a5909c15587dcc0c5135d6017d5d2e831db91`

See more details on using hashes here.

File details

Details for the file dataruff-0.1.0-py3-none-any.whl.

File metadata

Download URL: dataruff-0.1.0-py3-none-any.whl
Upload date: May 31, 2026
Size: 24.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for dataruff-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`03bebea4a5068ebb1256a605e5697ad02c22c0c9514f900433ea42c2818ae221`
MD5	`994e303771886adf9d661cbde03815cc`
BLAKE2b-256	`d0ca3d1e2fc73c0ebaedf450e109298ccd5e54e7c7e39f349337942fe72c4baf`

See more details on using hashes here.

dataruff 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

dataruff

Install

Quick start

API reference

Scoring formula

`fix()` — what gets repaired

`validate()` — schema rules

`detect_pii()` — supported PII types

CLI

Architecture

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

dataruff 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

dataruff

Install

Quick start

API reference

Scoring formula

fix() — what gets repaired

validate() — schema rules

detect_pii() — supported PII types

CLI

Architecture

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`fix()` — what gets repaired

`validate()` — schema rules

`detect_pii()` — supported PII types