Audit messy DataFrames, auto-fix issues, and run five-method correlation analysis — zero dependencies beyond pandas.

These details have not been verified by PyPI

Project links

Project description

dfdoctor 🩺

The data quality library that doesn't just tell you what's wrong — it tells you what to do next, fixes it for you, and shows you the full picture.

What is dfdoctor?

dfdoctor is a lightweight, zero-dependency Python library (beyond pandas) that audits a messy DataFrame and gives you:

Plain-English explanations of every issue found
Priority scores so you know what to fix first
Automatic fixes with a full before/after comparison
Five correlation methods — Pearson, Spearman, Kendall τ, Cramér's V, and Phi-k — all from scratch, no scipy
ASCII charts in the terminal and SVG heatmaps in the HTML report
A CLI tool so you can audit any CSV without writing a line of code

Most data tools give you statistics. dfdoctor gives you a treatment plan.

from dfdoctor import audit

report = audit(df)
report.pretty_print()

════════════════════════════════════════════════════════════════════
  dfdoctor — Dataset Audit Report
════════════════════════════════════════════════════════════════════
  Rows        : 10,000
  Columns     : 12
  Duplicates  : 215
  Memory      : 2.4 MB

  Issues Found: 8  (high: 3, medium: 4, low: 1)
════════════════════════════════════════════════════════════════════

  !!! [signup_date]  (HIGH)
      'signup_date' looks like a date column stored as text (98% parseable).
      Why it matters : Dates as strings block time-series ops and sorting.
      Recommendation : df['signup_date'] = pd.to_datetime(df['signup_date'])
      Safe auto-fix  : No   |  Score: 2.55
  ────────────────────────────────────────────────────────────────
  ...

Installation

pip install dfdoctor

Requirements: Python 3.9+ · pandas 1.3+ · no other dependencies.

To install with development tools:

pip install "dfdoctor[dev]"

Five-minute quick start

import pandas as pd
from dfdoctor import audit, auto_fix, compare, correlate

df = pd.read_csv("your_data.csv")

# 1. Audit — find every issue
report = audit(df)
report.pretty_print()

# 2. Fix — apply safe fixes automatically
cleaned, log = auto_fix(df)
print(log)   # ["Dropped all-null column 'notes'", "Converted 'revenue' to numeric", ...]

# 3. Compare — see exactly what changed
compare(df, cleaned).pretty_print()

# 4. Correlate — full five-method correlation analysis
corr = report.correlations()
corr.pretty_print()

# 5. Visualise — ASCII charts in terminal, HTML report with SVG heatmaps
report.plot()
report.to_html("audit_report.html")

Feature walkthrough

Audit

audit(df) runs every detector and returns an AuditReport with all issues ranked by priority.

from dfdoctor import audit

report = audit(df)

report.pretty_print()          # formatted terminal output
report.summary()               # dict: row/col/dup/memory counts + issue totals
report.high_priority()         # list of HIGH severity issues
report.sorted_by_priority()    # all issues, highest score first
report.by_column("revenue")    # issues for one specific column
report.to_dict()               # full JSON-serialisable output

Auto-fix

auto_fix() applies safe, reversible fixes automatically. Risky fixes (like deduplication) are opt-in with safe_only=False.

from dfdoctor import auto_fix, compare

# Safe fixes only (default)
cleaned, log = auto_fix(df)

# All fixes including risky ones
cleaned, log = auto_fix(df, safe_only=False)

# See a full before/after breakdown
compare(df, cleaned).pretty_print()

Safe fixes applied automatically:

Issue	Fix
All-null column	Drop the column
Constant column	Drop the column
Numeric stored as string	`pd.to_numeric()`
Suspected identifier	Cast to string

Risky fixes (opt-in with safe_only=False):

Issue	Fix
Duplicate rows	`drop_duplicates()`
Suspicious placeholders	Replace with `pd.NA`
Date stored as string	`pd.to_datetime()`

Correlation analysis

correlate(df) computes all five correlation methods with zero extra dependencies — no scipy, no statsmodels, nothing beyond pandas and numpy.

from dfdoctor import correlate

corr = correlate(df)          # or: report.correlations()
corr.pretty_print()

════════════════════════════════════════════════════════════════════
  dfdoctor — Correlation Report
════════════════════════════════════════════════════════════════════

  Pearson r  (numeric × numeric, linear) — 4 cols
  ─────────────────────────────────────────────
            revenue    spend    visits    age
  revenue      1.00     0.87      0.43  -0.12
  spend        0.87     1.00      0.51  -0.09
  ...

  Phi-k  (ALL columns × ALL columns, 0=none 1=perfect) — 8 cols
  ─────────────────────────────────────────────────────────────
  ...covers numeric AND categorical columns in one matrix...

  Top Correlated Pairs  (|value| ≥ 0.4):
  ────────────────────────────────────────
  revenue          × spend           [pearson ]  ████████████████████  +0.870  (strong)
  country          × region          [cramers_v]  █████████████░░░░░░░  +0.650  (strong)

Method	Type	Range	What it measures
Pearson r	Numeric × Numeric	−1 … +1	Linear relationship
Spearman ρ	Numeric × Numeric	−1 … +1	Rank-order relationship
Kendall τ	Numeric × Numeric	−1 … +1	Concordance (robust on ties)
Cramér's V	Categorical × Categorical	0 … 1	Association strength
Phi-k	Any × Any	0 … 1	Universal association (numeric bins → Cramér's V)

Exploratory data analysis

from dfdoctor import quick_eda

insights = quick_eda(df, target="churn")

insights.pretty_print()

print(insights.high_missing_columns)   # [{"column": "notes", "missing_pct": 0.82}, ...]
print(insights.skewed_columns)         # ["revenue", "session_length"]
print(insights.strong_correlations)    # [("revenue", "spend", 0.87)]
print(insights.target_correlations)    # {"revenue": 0.61, "spend": 0.55, ...}
print(insights.top_findings)           # plain-English list of key insights

Visualisations

Two output modes, zero extra dependencies:

Terminal — ASCII bar charts

report.plot()
# or standalone:
from dfdoctor import plot_ascii
plot_ascii(report)

════════════════════════════════════════════════════════════════════
  dfdoctor — Visualizations
════════════════════════════════════════════════════════════════════

  Issue Severity Breakdown
  ────────────────────────────────────────────────────
  HIGH        ████████████░░░░░░░░░░░░░░░░   3.0
  MEDIUM      ████████████████████████████   4.0
  LOW         ███████░░░░░░░░░░░░░░░░░░░░░   1.0

  Top Missing-Value Columns  (% missing)
  ────────────────────────────────────────────────────
  notes             ████████████████████████████  82.0%
  phone             ████████░░░░░░░░░░░░░░░░░░░░  23.0%

HTML — self-contained report with SVG charts

report.to_html("audit_report.html")   # write to file
html = report.to_html()               # get string

The HTML report includes a stats dashboard, colour-coded issue table, SVG bar charts, and five interactive SVG correlation heatmaps — all in a single self-contained file with no external assets.

Command-line interface

No Python required. Audit any CSV, TSV, or Excel file directly from your terminal:

# Audit a file
dfdoctor audit data.csv

# Auto-fix and save
dfdoctor fix data.csv --output cleaned.csv

# Apply all fixes (including risky)
dfdoctor fix data.csv --all --output cleaned.csv

# Generate HTML report
dfdoctor html data.csv --output report.html

Reading files

from dfdoctor import read_file

df = read_file("data.csv")       # CSV
df = read_file("data.tsv")       # TSV
df = read_file("data.xlsx")      # Excel (requires openpyxl)
df = read_file("data.xls")       # Legacy Excel

Prioritised cleaning suggestions

from dfdoctor import suggest_cleaning

suggestions = suggest_cleaning(df)   # sorted by priority score, highest first

for issue in suggestions:
    print(f"[{issue.severity.upper()}]  {issue.message}")
    print(f"  → {issue.recommendation}\n")

What dfdoctor detects

Issue	Severity	Auto-fixable
All-null column	HIGH	✅ Safe
High missing values (≥ 50%)	HIGH	—
Moderate missing values (≥ 20%)	MEDIUM	—
Duplicate rows	HIGH	⚠️ Risky
Constant column (one unique value)	MEDIUM	✅ Safe
Near-constant column (≥ 95% one value)	MEDIUM	—
Numeric stored as string	MEDIUM	✅ Safe
Date column stored as string	HIGH	⚠️ Risky
Mixed date formats in one column	HIGH	—
Suspected identifier column	MEDIUM	✅ Safe
High-cardinality categorical	MEDIUM	—
Inconsistent category labels (e.g. "US" vs "U.S.")	MEDIUM	—
Suspicious placeholder values ("NA", "?", "unknown")	LOW	⚠️ Risky
Statistical outliers (IQR method)	MEDIUM	—

Priority scoring

Every issue gets a numeric score so you know what to tackle first — no more guessing:

priority_score = severity_weight × confidence × impact

Component	Description
`severity_weight`	HIGH = 3, MEDIUM = 2, LOW = 1
`confidence`	0.0 – 1.0 — how certain the rule is
`impact`	0.0 – 1.0 — how much this affects downstream analysis

for issue in report.sorted_by_priority():
    print(f"{issue.priority_score:.2f}  [{issue.severity}]  {issue.column}  —  {issue.message}")

The `Issue` object

Every issue returned by audit() or suggest_cleaning() is an Issue dataclass:

Field	Type	Description
`column`	`str \| None`	Column the issue belongs to (`None` = dataset-level)
`issue_type`	`str`	Machine-readable key, e.g. `"date_as_string"`
`severity`	`str`	`"high"`, `"medium"`, or `"low"`
`confidence`	`float`	0.0 – 1.0
`impact`	`float`	0.0 – 1.0
`message`	`str`	Plain-English description
`why_it_matters`	`str`	Why this issue is a problem
`recommendation`	`str`	Exact code fix to apply
`safe_to_auto_fix`	`bool`	Whether `auto_fix()` will apply this by default
`priority_score`	`float`	`severity_weight × confidence × impact`

Why zero dependencies?

Every alternative library (ydata-profiling, sweetviz, dataprep) pulls in matplotlib, scipy, seaborn, and dozens more. dfdoctor requires only pandas — which you already have.

This means:

Works in any environment: CI/CD pipelines, serverless functions, Docker containers, Jupyter, Colab, bare scripts
Installs in seconds with no dependency conflicts
Five correlation methods including Kendall τ and Phi-k — all implemented with pure numpy, no scipy required

Project structure

dfdoctor/
├── src/
│   └── dfdoctor/
│       ├── __init__.py        # public API exports
│       ├── audit.py           # main audit() function
│       ├── types.py           # AuditReport, Issue dataclasses
│       ├── suggest.py         # suggest_cleaning()
│       ├── eda.py             # quick_eda(), EDAReport
│       ├── fix.py             # auto_fix()
│       ├── compare.py         # compare(), CompareReport
│       ├── correlations.py    # correlate(), five methods, zero-dep
│       ├── viz.py             # plot_ascii(), SVG chart generators
│       ├── cli.py             # dfdoctor CLI (argparse)
│       ├── utils.py           # read_file(), memory helpers
│       └── rules/
│           ├── missing.py     # null / high-missing detection
│           ├── duplicates.py  # duplicate row detection
│           ├── datatypes.py   # numeric-as-string, type inference
│           ├── identifiers.py # suspected ID column detection
│           ├── dates.py       # date-as-string, mixed formats
│           ├── cardinality.py # high-cardinality categoricals
│           ├── categories.py  # inconsistent labels, placeholders
│           └── outliers.py    # IQR-based outlier detection
├── tests/                     # 132 tests, 0 warnings
├── demo/
│   ├── messy_sales_data.csv   # example messy dataset (215 rows)
│   └── run_demo.py            # full end-to-end demo script
├── pyproject.toml
├── LICENSE
└── README.md

API reference

`audit(df) → AuditReport`

report = audit(df)
report.pretty_print()
report.summary()               # → dict
report.to_dict()               # → dict (JSON-serialisable)
report.high_priority()         # → list[Issue]
report.sorted_by_priority()    # → list[Issue]
report.by_column("col")        # → list[Issue]
report.plot()                  # print ASCII charts
report.to_html("report.html")  # save HTML report
report.correlations()          # → CorrelationReport

`auto_fix(df, safe_only=True) → tuple[DataFrame, list[str]]`

cleaned, log = auto_fix(df)                  # safe fixes only
cleaned, log = auto_fix(df, safe_only=False) # all fixes

`compare(df_before, df_after) → CompareReport`

rep = compare(df, cleaned)
rep.pretty_print()
rep.to_dict()

`correlate(df) → CorrelationReport`

corr = correlate(df)
corr.pearson_matrix     # dict[str, dict[str, float]]
corr.spearman_matrix    # dict[str, dict[str, float]]
corr.kendall_matrix     # dict[str, dict[str, float]]
corr.cramers_matrix     # dict[str, dict[str, float]]
corr.phik_matrix        # dict[str, dict[str, float]] — ALL column pairs
corr.top_pairs          # list[CorrelationPair], sorted by |value|
corr.pretty_print()
corr.to_dict()

`quick_eda(df, target=None) → EDAReport`

insights = quick_eda(df, target="churn")
insights.pretty_print()
insights.top_findings           # list[str]
insights.strong_correlations    # list[tuple]
insights.skewed_columns         # list[str]
insights.high_missing_columns   # list[dict]
insights.target_correlations    # dict[str, float]

`suggest_cleaning(df) → list[Issue]`

Returns issues sorted by priority score (highest first).

`read_file(path) → DataFrame`

Supports .csv, .tsv, .xlsx, .xls, .xlsm.

`plot_ascii(report) → None`

Prints ASCII bar charts for issue severity, missing values, and outliers.

Contributing

Pull requests are welcome. To get started:

git clone https://github.com/ajayvarmaramineni/dfdoctor
cd dfdoctor
pip install -e ".[dev]"
pytest tests/

Please open an issue first to discuss what you'd like to change. All contributions should include tests and pass with 0 warnings.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfdoctor-0.3.0.tar.gz (53.2 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dfdoctor-0.3.0-py3-none-any.whl (46.3 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file dfdoctor-0.3.0.tar.gz.

File metadata

Download URL: dfdoctor-0.3.0.tar.gz
Upload date: Apr 4, 2026
Size: 53.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for dfdoctor-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`2c0a7fc40d77bb956c53bfb5dff41659bfc3363f513f964c17430943da6bf552`
MD5	`a265e23076de2597a83209b162d6a2b9`
BLAKE2b-256	`193fe5fbebce6eee6ace8805d4ddba22a37288f29cc69cf4a18dd57f723ee18b`

See more details on using hashes here.

File details

Details for the file dfdoctor-0.3.0-py3-none-any.whl.

File metadata

Download URL: dfdoctor-0.3.0-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 46.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for dfdoctor-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1a234a96ae478a80fa955dad9a6b5369d5b83c7518351b162ab5c5c7908296f0`
MD5	`964d0ee8b75d5cb472a5396df7e68ed5`
BLAKE2b-256	`c746fa4a93c0141aef7b99bb284a8c1dd6427bb5da6864ae738ffb403a779cb7`

See more details on using hashes here.

dfdoctor 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dfdoctor 🩺

What is dfdoctor?

Installation

Five-minute quick start

Feature walkthrough

Audit

Auto-fix

Correlation analysis

Exploratory data analysis

Visualisations

Terminal — ASCII bar charts

HTML — self-contained report with SVG charts

Command-line interface

Reading files

Prioritised cleaning suggestions

What dfdoctor detects

Priority scoring

The Issue object

Why zero dependencies?

Project structure

API reference

audit(df) → AuditReport

auto_fix(df, safe_only=True) → tuple[DataFrame, list[str]]

compare(df_before, df_after) → CompareReport

correlate(df) → CorrelationReport

quick_eda(df, target=None) → EDAReport

suggest_cleaning(df) → list[Issue]

read_file(path) → DataFrame

plot_ascii(report) → None

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

The `Issue` object

`audit(df) → AuditReport`

`auto_fix(df, safe_only=True) → tuple[DataFrame, list[str]]`

`compare(df_before, df_after) → CompareReport`

`correlate(df) → CorrelationReport`

`quick_eda(df, target=None) → EDAReport`

`suggest_cleaning(df) → list[Issue]`

`read_file(path) → DataFrame`

`plot_ascii(report) → None`