One-command dataset health diagnostics - the Ruff of datasets.
Project description
dataruff
The Ruff of datasets. One command to discover, explain, score, and fix data quality problems in Pandas DataFrames and CSV/Excel files.
from dataruff import audit
audit(df)
Data Quality Score: 81/100
Issues Found (5):
! 42 duplicate rows
~ 13 invalid email (column: email)
! 3 empty columns
~ 7 outlier (column: salary)
. 2 inconsistent date format (column: created_at)
Rows: 10,000 | Columns: 12
Install
pip install dataruff
Optionally install rich for prettier terminal output:
pip install dataruff[rich]
Quick start
import pandas as pd
from dataruff import audit, fix, score, validate, detect_pii
df = pd.read_csv("customers.csv")
# Full health report
audit(df)
# Get numeric score
s = score(df)
print(s.overall) # 81
print(s.to_dict()) # {'overall': 81, 'completeness': 92, ...}
# Auto-fix common issues
clean_df = fix(df)
# Validate against a schema
result = validate(df, schema={
"email": "email",
"age": "0-120",
"id": "unique",
})
# PII detection
report = detect_pii(df)
print(report.columns_with_pii)
# {'email': ['email'], 'phone': ['phone'], 'uid': ['aadhaar']}
API reference
| Function | Description | Returns |
|---|---|---|
audit(df) |
Print full health report | InvestigationReport |
investigate(df) |
Structured issue breakdown | InvestigationReport |
score(df) |
Data quality score | ScoreBreakdown |
fix(df) |
Auto-repair common issues | pd.DataFrame |
validate(df, schema) |
Check schema constraints | dict |
compare(old, new) |
Diff two datasets | ComparisonReport |
detect_pii(df) |
Find PII columns | PIIReport |
mask_pii(df) |
Redact PII values | pd.DataFrame |
detect_drift(old, new) |
Distribution drift analysis | DriftReport |
find_anomalies(df) |
Anomaly / outlier detection | dict |
All functions accept a DataFrame, CSV path, or XLSX path as input.
Scoring formula
| Dimension | Weight | Measures |
|---|---|---|
| Completeness | 25% | Non-null ratio across all cells |
| Validity | 25% | Format correctness (emails, dates, types) |
| Consistency | 20% | Uniform types and formats per column |
| Uniqueness | 20% | Absence of duplicate rows |
| Schema compliance | 10% | Adherence to user-provided schema |
fix() — what gets repaired
| Issue | Fix applied |
|---|---|
| Duplicate rows | Removed |
| Leading/trailing whitespace | Stripped |
Boolean strings (yes/no/true/false) |
Converted to bool |
| Mixed date formats | Normalized to YYYY-MM-DD |
| Missing numeric values | Filled with column median |
| Missing string values | Filled with column mode |
validate() — schema rules
validate(df, schema={
"email": "email", # valid email format
"age": "0-120", # numeric range
"user_id": "unique", # no duplicates
"price": "positive", # > 0
"code": "not_null", # no missing values
"ref": "regex:[A-Z]{3}", # custom regex
})
detect_pii() — supported PII types
| Type | Example |
|---|---|
email |
alice@example.com |
phone |
9876543210 |
aadhaar |
2345 6789 0123 |
pan |
ABCDE1234F |
ssn |
123-45-6789 |
credit_card |
4111 1111 1111 1111 |
CLI
# Audit a CSV file
dataruff audit customers.csv
# Output as JSON
dataruff audit customers.csv --json
# Fix issues and write cleaned file
dataruff fix customers.csv
# -> customers_clean.csv
# Compare two datasets
dataruff compare old.csv new.csv
# Data quality score
dataruff score customers.csv
# PII detection
dataruff detect-pii customers.csv
# Mask PII
dataruff mask-pii customers.csv
# -> customers_masked.csv
Architecture
dataruff/
├── analyzers/ # DuplicateAnalyzer, NullAnalyzer, TypeAnalyzer,
│ # FormatAnalyzer, OutlierAnalyzer, PIIAnalyzer, DriftAnalyzer
├── scoring/ # Weighted scoring engine
├── fixing/ # Auto-remediation rules
└── reporting/ # Terminal (rich + plain fallback) and JSON output
No LLMs. No API calls. Everything deterministic and offline.
Requirements
- Python 3.10+
- pandas, numpy, scipy, scikit-learn, openpyxl, python-dateutil
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataruff-0.1.1.tar.gz.
File metadata
- Download URL: dataruff-0.1.1.tar.gz
- Upload date:
- Size: 26.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
628134024c33bb59281343eb30362af98ff0f3cfbdc7f700bf278c11bda373c1
|
|
| MD5 |
1dc2177ecb7bd4c7102523078b7b90ab
|
|
| BLAKE2b-256 |
bcee1999e5d735e1e9031108ebc87f95f8fdfa19568fb103d24af2ae879752eb
|
File details
Details for the file dataruff-0.1.1-py3-none-any.whl.
File metadata
- Download URL: dataruff-0.1.1-py3-none-any.whl
- Upload date:
- Size: 24.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af426d8a55a6d2b9f22257c74c052f1d832db6e309a50829b8248a9f53fa9847
|
|
| MD5 |
ce5109b34b996a84275f17033fdfd906
|
|
| BLAKE2b-256 |
e63b9e6095c47b5ba06a9c3c2322e5cb14c0c8392d5b1501eb82413c8bbe188e
|