Audit messy DataFrames, auto-fix issues, and run five-method correlation analysis โ zero dependencies beyond pandas.
Project description
dfdoctor ๐ฉบ
The data quality library that doesn't just tell you what's wrong โ it tells you what to do next, fixes it for you, and shows you the full picture.
What is dfdoctor?
dfdoctor is a lightweight, zero-dependency Python library (beyond pandas) that audits a messy DataFrame and gives you:
- Plain-English explanations of every issue found
- Priority scores so you know what to fix first
- Automatic fixes with a full before/after comparison
- Five correlation methods โ Pearson, Spearman, Kendall ฯ, Cramรฉr's V, and Phi-k โ all from scratch, no scipy
- ASCII charts in the terminal and SVG heatmaps in the HTML report
- A CLI tool so you can audit any CSV without writing a line of code
Most data tools give you statistics. dfdoctor gives you a treatment plan.
from dfdoctor import audit
report = audit(df)
report.pretty_print()
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
dfdoctor โ Dataset Audit Report
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Rows : 10,000
Columns : 12
Duplicates : 215
Memory : 2.4 MB
Issues Found: 8 (high: 3, medium: 4, low: 1)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
!!! [signup_date] (HIGH)
'signup_date' looks like a date column stored as text (98% parseable).
Why it matters : Dates as strings block time-series ops and sorting.
Recommendation : df['signup_date'] = pd.to_datetime(df['signup_date'])
Safe auto-fix : No | Score: 2.55
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
...
Installation
pip install dfdoctor
Requirements: Python 3.9+ ยท pandas 1.3+ ยท no other dependencies.
To install with development tools:
pip install "dfdoctor[dev]"
Five-minute quick start
import pandas as pd
from dfdoctor import audit, auto_fix, compare, correlate
df = pd.read_csv("your_data.csv")
# 1. Audit โ find every issue
report = audit(df)
report.pretty_print()
# 2. Fix โ apply safe fixes automatically
cleaned, log = auto_fix(df)
print(log) # ["Dropped all-null column 'notes'", "Converted 'revenue' to numeric", ...]
# 3. Compare โ see exactly what changed
compare(df, cleaned).pretty_print()
# 4. Correlate โ full five-method correlation analysis
corr = report.correlations()
corr.pretty_print()
# 5. Visualise โ ASCII charts in terminal, HTML report with SVG heatmaps
report.plot()
report.to_html("audit_report.html")
Feature walkthrough
Audit
audit(df) runs every detector and returns an AuditReport with all issues ranked by priority.
from dfdoctor import audit
report = audit(df)
report.pretty_print() # formatted terminal output
report.summary() # dict: row/col/dup/memory counts + issue totals
report.high_priority() # list of HIGH severity issues
report.sorted_by_priority() # all issues, highest score first
report.by_column("revenue") # issues for one specific column
report.to_dict() # full JSON-serialisable output
Auto-fix
auto_fix() applies safe, reversible fixes automatically. Risky fixes (like deduplication) are opt-in with safe_only=False.
from dfdoctor import auto_fix, compare
# Safe fixes only (default)
cleaned, log = auto_fix(df)
# All fixes including risky ones
cleaned, log = auto_fix(df, safe_only=False)
# See a full before/after breakdown
compare(df, cleaned).pretty_print()
Safe fixes applied automatically:
| Issue | Fix |
|---|---|
| All-null column | Drop the column |
| Constant column | Drop the column |
| Numeric stored as string | pd.to_numeric() |
| Suspected identifier | Cast to string |
Risky fixes (opt-in with safe_only=False):
| Issue | Fix |
|---|---|
| Duplicate rows | drop_duplicates() |
| Suspicious placeholders | Replace with pd.NA |
| Date stored as string | pd.to_datetime() |
Correlation analysis
correlate(df) computes all five correlation methods with zero extra dependencies โ no scipy, no statsmodels, nothing beyond pandas and numpy.
from dfdoctor import correlate
corr = correlate(df) # or: report.correlations()
corr.pretty_print()
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
dfdoctor โ Correlation Report
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Pearson r (numeric ร numeric, linear) โ 4 cols
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
revenue spend visits age
revenue 1.00 0.87 0.43 -0.12
spend 0.87 1.00 0.51 -0.09
...
Phi-k (ALL columns ร ALL columns, 0=none 1=perfect) โ 8 cols
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
...covers numeric AND categorical columns in one matrix...
Top Correlated Pairs (|value| โฅ 0.4):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
revenue ร spend [pearson ] โโโโโโโโโโโโโโโโโโโโ +0.870 (strong)
country ร region [cramers_v] โโโโโโโโโโโโโโโโโโโโ +0.650 (strong)
| Method | Type | Range | What it measures |
|---|---|---|---|
| Pearson r | Numeric ร Numeric | โ1 โฆ +1 | Linear relationship |
| Spearman ฯ | Numeric ร Numeric | โ1 โฆ +1 | Rank-order relationship |
| Kendall ฯ | Numeric ร Numeric | โ1 โฆ +1 | Concordance (robust on ties) |
| Cramรฉr's V | Categorical ร Categorical | 0 โฆ 1 | Association strength |
| Phi-k | Any ร Any | 0 โฆ 1 | Universal association (numeric bins โ Cramรฉr's V) |
Exploratory data analysis
from dfdoctor import quick_eda
insights = quick_eda(df, target="churn")
insights.pretty_print()
print(insights.high_missing_columns) # [{"column": "notes", "missing_pct": 0.82}, ...]
print(insights.skewed_columns) # ["revenue", "session_length"]
print(insights.strong_correlations) # [("revenue", "spend", 0.87)]
print(insights.target_correlations) # {"revenue": 0.61, "spend": 0.55, ...}
print(insights.top_findings) # plain-English list of key insights
Visualisations
Two output modes, zero extra dependencies:
Terminal โ ASCII bar charts
report.plot()
# or standalone:
from dfdoctor import plot_ascii
plot_ascii(report)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
dfdoctor โ Visualizations
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Issue Severity Breakdown
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
HIGH โโโโโโโโโโโโโโโโโโโโโโโโโโโโ 3.0
MEDIUM โโโโโโโโโโโโโโโโโโโโโโโโโโโโ 4.0
LOW โโโโโโโโโโโโโโโโโโโโโโโโโโโโ 1.0
Top Missing-Value Columns (% missing)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
notes โโโโโโโโโโโโโโโโโโโโโโโโโโโโ 82.0%
phone โโโโโโโโโโโโโโโโโโโโโโโโโโโโ 23.0%
HTML โ self-contained report with SVG charts
report.to_html("audit_report.html") # write to file
html = report.to_html() # get string
The HTML report includes a stats dashboard, colour-coded issue table, SVG bar charts, and five interactive SVG correlation heatmaps โ all in a single self-contained file with no external assets.
Command-line interface
No Python required. Audit any CSV, TSV, or Excel file directly from your terminal:
# Audit a file
dfdoctor audit data.csv
# Auto-fix and save
dfdoctor fix data.csv --output cleaned.csv
# Apply all fixes (including risky)
dfdoctor fix data.csv --all --output cleaned.csv
# Generate HTML report
dfdoctor html data.csv --output report.html
Reading files
from dfdoctor import read_file
df = read_file("data.csv") # CSV
df = read_file("data.tsv") # TSV
df = read_file("data.xlsx") # Excel (requires openpyxl)
df = read_file("data.xls") # Legacy Excel
Prioritised cleaning suggestions
from dfdoctor import suggest_cleaning
suggestions = suggest_cleaning(df) # sorted by priority score, highest first
for issue in suggestions:
print(f"[{issue.severity.upper()}] {issue.message}")
print(f" โ {issue.recommendation}\n")
What dfdoctor detects
| Issue | Severity | Auto-fixable |
|---|---|---|
| All-null column | HIGH | โ Safe |
| High missing values (โฅ 50%) | HIGH | โ |
| Moderate missing values (โฅ 20%) | MEDIUM | โ |
| Duplicate rows | HIGH | โ ๏ธ Risky |
| Constant column (one unique value) | MEDIUM | โ Safe |
| Near-constant column (โฅ 95% one value) | MEDIUM | โ |
| Numeric stored as string | MEDIUM | โ Safe |
| Date column stored as string | HIGH | โ ๏ธ Risky |
| Mixed date formats in one column | HIGH | โ |
| Suspected identifier column | MEDIUM | โ Safe |
| High-cardinality categorical | MEDIUM | โ |
| Inconsistent category labels (e.g. "US" vs "U.S.") | MEDIUM | โ |
| Suspicious placeholder values ("NA", "?", "unknown") | LOW | โ ๏ธ Risky |
| Statistical outliers (IQR method) | MEDIUM | โ |
Priority scoring
Every issue gets a numeric score so you know what to tackle first โ no more guessing:
priority_score = severity_weight ร confidence ร impact
| Component | Description |
|---|---|
severity_weight |
HIGH = 3, MEDIUM = 2, LOW = 1 |
confidence |
0.0 โ 1.0 โ how certain the rule is |
impact |
0.0 โ 1.0 โ how much this affects downstream analysis |
for issue in report.sorted_by_priority():
print(f"{issue.priority_score:.2f} [{issue.severity}] {issue.column} โ {issue.message}")
The Issue object
Every issue returned by audit() or suggest_cleaning() is an Issue dataclass:
| Field | Type | Description |
|---|---|---|
column |
str | None |
Column the issue belongs to (None = dataset-level) |
issue_type |
str |
Machine-readable key, e.g. "date_as_string" |
severity |
str |
"high", "medium", or "low" |
confidence |
float |
0.0 โ 1.0 |
impact |
float |
0.0 โ 1.0 |
message |
str |
Plain-English description |
why_it_matters |
str |
Why this issue is a problem |
recommendation |
str |
Exact code fix to apply |
safe_to_auto_fix |
bool |
Whether auto_fix() will apply this by default |
priority_score |
float |
severity_weight ร confidence ร impact |
Why zero dependencies?
Every alternative library (ydata-profiling, sweetviz, dataprep) pulls in matplotlib, scipy, seaborn, and dozens more. dfdoctor requires only pandas โ which you already have.
This means:
- Works in any environment: CI/CD pipelines, serverless functions, Docker containers, Jupyter, Colab, bare scripts
- Installs in seconds with no dependency conflicts
- Five correlation methods including Kendall ฯ and Phi-k โ all implemented with pure numpy, no scipy required
Project structure
dfdoctor/
โโโ src/
โ โโโ dfdoctor/
โ โโโ __init__.py # public API exports
โ โโโ audit.py # main audit() function
โ โโโ types.py # AuditReport, Issue dataclasses
โ โโโ suggest.py # suggest_cleaning()
โ โโโ eda.py # quick_eda(), EDAReport
โ โโโ fix.py # auto_fix()
โ โโโ compare.py # compare(), CompareReport
โ โโโ correlations.py # correlate(), five methods, zero-dep
โ โโโ viz.py # plot_ascii(), SVG chart generators
โ โโโ cli.py # dfdoctor CLI (argparse)
โ โโโ utils.py # read_file(), memory helpers
โ โโโ rules/
โ โโโ missing.py # null / high-missing detection
โ โโโ duplicates.py # duplicate row detection
โ โโโ datatypes.py # numeric-as-string, type inference
โ โโโ identifiers.py # suspected ID column detection
โ โโโ dates.py # date-as-string, mixed formats
โ โโโ cardinality.py # high-cardinality categoricals
โ โโโ categories.py # inconsistent labels, placeholders
โ โโโ outliers.py # IQR-based outlier detection
โโโ tests/ # 132 tests, 0 warnings
โโโ demo/
โ โโโ messy_sales_data.csv # example messy dataset (215 rows)
โ โโโ run_demo.py # full end-to-end demo script
โโโ pyproject.toml
โโโ LICENSE
โโโ README.md
API reference
audit(df) โ AuditReport
report = audit(df)
report.pretty_print()
report.summary() # โ dict
report.to_dict() # โ dict (JSON-serialisable)
report.high_priority() # โ list[Issue]
report.sorted_by_priority() # โ list[Issue]
report.by_column("col") # โ list[Issue]
report.plot() # print ASCII charts
report.to_html("report.html") # save HTML report
report.correlations() # โ CorrelationReport
auto_fix(df, safe_only=True) โ tuple[DataFrame, list[str]]
cleaned, log = auto_fix(df) # safe fixes only
cleaned, log = auto_fix(df, safe_only=False) # all fixes
compare(df_before, df_after) โ CompareReport
rep = compare(df, cleaned)
rep.pretty_print()
rep.to_dict()
correlate(df) โ CorrelationReport
corr = correlate(df)
corr.pearson_matrix # dict[str, dict[str, float]]
corr.spearman_matrix # dict[str, dict[str, float]]
corr.kendall_matrix # dict[str, dict[str, float]]
corr.cramers_matrix # dict[str, dict[str, float]]
corr.phik_matrix # dict[str, dict[str, float]] โ ALL column pairs
corr.top_pairs # list[CorrelationPair], sorted by |value|
corr.pretty_print()
corr.to_dict()
quick_eda(df, target=None) โ EDAReport
insights = quick_eda(df, target="churn")
insights.pretty_print()
insights.top_findings # list[str]
insights.strong_correlations # list[tuple]
insights.skewed_columns # list[str]
insights.high_missing_columns # list[dict]
insights.target_correlations # dict[str, float]
suggest_cleaning(df) โ list[Issue]
Returns issues sorted by priority score (highest first).
read_file(path) โ DataFrame
Supports .csv, .tsv, .xlsx, .xls, .xlsm.
plot_ascii(report) โ None
Prints ASCII bar charts for issue severity, missing values, and outliers.
Contributing
Pull requests are welcome. To get started:
git clone https://github.com/ajayvarmaramineni/dfdoctor
cd dfdoctor
pip install -e ".[dev]"
pytest tests/
Please open an issue first to discuss what you'd like to change. All contributions should include tests and pass with 0 warnings.
License
MIT ยฉ Ajay Ramineni
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dfdoctor-0.3.0.tar.gz.
File metadata
- Download URL: dfdoctor-0.3.0.tar.gz
- Upload date:
- Size: 53.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c0a7fc40d77bb956c53bfb5dff41659bfc3363f513f964c17430943da6bf552
|
|
| MD5 |
a265e23076de2597a83209b162d6a2b9
|
|
| BLAKE2b-256 |
193fe5fbebce6eee6ace8805d4ddba22a37288f29cc69cf4a18dd57f723ee18b
|
File details
Details for the file dfdoctor-0.3.0-py3-none-any.whl.
File metadata
- Download URL: dfdoctor-0.3.0-py3-none-any.whl
- Upload date:
- Size: 46.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a234a96ae478a80fa955dad9a6b5369d5b83c7518351b162ab5c5c7908296f0
|
|
| MD5 |
964d0ee8b75d5cb472a5396df7e68ed5
|
|
| BLAKE2b-256 |
c746fa4a93c0141aef7b99bb284a8c1dd6427bb5da6864ae738ffb403a779cb7
|