Dataset Auto-Diagnosis Python Library — find and fix data problems before model training.

These details have not been verified by PyPI

Project links

Project description

DataDiagnose

A zero-dependency Python library that audits your dataset and prescribes fixes before you train your model.

The Problem This Solves

Every beginner data scientist goes through the same painful experience. You download a dataset, you are excited to build your first model, you train it — and the results are terrible. 50% accuracy. Predictions that make no sense. Hours of confusion.

In most cases the model is not the problem. The data is broken. There are missing values pulling your model in the wrong direction. An outlier like an age of 900 years is destroying your statistics. Your target column has 95% of rows saying "no", so your model just learned to say "no" for everything.

Experienced data scientists know to check for these things before touching a model. They run a full diagnostic on the dataset first. DataDiagnose automates that diagnostic in one function call.

What It Does

Give DataDiagnose any dataset and it returns:

A health score from 0 to 100 showing how clean your data is
A list of every problem detected, with severity (CRITICAL / HIGH / MEDIUM / LOW)
A specific fix suggestion for every problem
Model recommendations based on your data characteristics
Feature engineering hints based on your column names
A Pandas DataFrame of all column statistics with one call (new in v1.0.3)

Eight problems are detected automatically:

Problem	What It Means
Missing Values	Null, empty, or NaN values in any column
Outliers	Extreme values detected by both IQR and Z-score methods
Skewness	Lopsided distributions that hurt linear models
Class Imbalance	One class vastly outnumbering others in your target
Data Leakage	Columns that secretly contain the answer
Duplicate Rows	Identical rows that bias your model
Constant Columns	Columns with zero variation — zero information
High Cardinality	ID-like columns with almost all unique values

What Is New in v1.0.4

1. Cleaner internal architecture

The redundant class imbalance pre-filter in core.py has been removed. All detection guards (regression target guard, numeric feature guard, binary column check) are now handled entirely inside detectors.py where they belong. This makes the code easier to read and extend — the detector is fully self-contained.

No change to behaviour — the output you get from diagnose() is identical to v1.0.3.

2. Smarter class imbalance detection (carried forward from v1.0.2)

Discrete numeric feature columns like bedrooms, bathrooms, stories, parking, AGE, AUCTION YEAR are no longer falsely flagged as imbalanced. The rule is precise:

Text/categorical columns (COUNTRY, TEAM, PhoneService) → imbalance checked ✅
Binary numeric columns (SeniorCitizen with values 0/1) → imbalance checked ✅
Target column → imbalance checked ✅
Discrete numeric features with 3+ unique values → stats recorded, no issue raised ✅

3. Proportional health scoring (carried forward from v1.0.2)

Wide datasets with 20+ columns no longer score 0/100. The scoring cap means a dataset with many minor skewness warnings scores fairly, while genuinely broken data with multiple CRITICAL issues still scores near 0.

Installation

DataDiagnose has zero runtime dependencies for its core features. It runs entirely on Python's standard library.

# Once published on PyPI
pip install datadiagnose

For now, copy the datadiagnose/ folder into your project and import directly.

To use get_stats_df() or pass a DataFrame directly, pandas must be installed:

pip install pandas

Quick Start

from datadiagnose import diagnose

dataset = {
    "age":    [25, 30, None, 22, 900, 28],
    "income": [50000, 60000, None, 48000, 52000, 61000],
    "city":   ["KOL", "MUM", "KOL", "DEL", "KOL", "KOL"],
    "target": [1, 0, 1, 0, 1, 0],
}

report = diagnose(dataset, target_col="target")
print(report)

Output:

==============================================================
  DATADIAGNOSE REPORT — DATASET
==============================================================
  Rows    : 6
  Columns : 4
  Score   : 84/100   Needs Work
--------------------------------------------------------------
  Issues Found (2)

  1. MEDIUM
     Missing Values in 'age'
     -> 16.7% of values are missing.
     Fix: Fill 'age' with median (numeric) or mode (categorical).

  2. MEDIUM
     Missing Values in 'income'
     -> 16.7% of values are missing.
     Fix: Fill 'income' with median (numeric) or mode (categorical).
...

Works With pandas — Three Ways

import pandas as pd
import datadiagnose as dd

df = pd.read_csv("titanic.csv")

# Way 1 — Pass DataFrame directly (new in v1.0.1)
report = dd.diagnose(df, target_col="Survived")

# Way 2 — Convert manually (original method, still works)
report = dd.diagnose(df.to_dict(orient="list"), target_col="Survived")

# Way 3 — Get all stats as a DataFrame instantly (new in v1.0.1)
df_stats = dd.get_stats_df(df, target_col="Survived")
print(df_stats)

Full API Reference

`diagnose(dataset, target_col=None, dataset_name="dataset")`

The main function. Runs all eight detectors and returns a DiagnosisReport.

New in v1.0.1: accepts pandas DataFrames directly without conversion.

report = diagnose(dataset, target_col="label", dataset_name="Titanic")

report.score          # int — health score 0-100
report.issues         # list of Issue objects
report.suggestions    # list of fix strings
report.model_types    # list of recommended model names
report.column_reports # dict — {col_name: ColumnReport}
report.n_rows         # int
report.n_cols         # int

`get_stats_df(dataset, target_col=None)` — new in v1.0.1

Runs a full diagnosis and returns all column statistics as a transposed Pandas DataFrame. Metrics are rows, your dataset columns are the headers.

import datadiagnose as dd

df_stats = dd.get_stats_df(my_dataset, target_col="target")
# Returns DataFrame — perfect for Jupyter notebooks
# Requires: pip install pandas

`report.to_df()` — new in v1.0.1

Method directly on the DiagnosisReport object. Does the same thing as get_stats_df() but you call it after you already have a report.

report   = diagnose(my_dataset, target_col="target")
df_stats = report.to_df()

`quick_scan(dataset, target_col=None)`

One-liner that runs the diagnosis and immediately prints the report to your terminal. Returns the DiagnosisReport object in case you need it.

quick_scan(dataset, target_col="label")

`health_score(dataset, target_col=None)`

Returns only the integer score (0–100). Perfect for quality gates in automated pipelines — block model training if data quality is too low.

score = health_score(dataset, target_col="label")

if score < 70:
    raise ValueError(f"Data quality too low: {score}/100. Fix issues first.")

`list_issues(dataset, target_col=None)`

Returns a concise list of (severity, title) tuples for quick scanning without reading the full report.

for severity, title in list_issues(dataset, "label"):
    print(f"[{severity}] {title}")

`get_suggestions(dataset, target_col=None)`

Returns only the list of actionable fix suggestions as plain strings.

for tip in get_suggestions(dataset, "label"):
    print("-", tip)

`column_summary(dataset, col_name, target_col=None)`

Returns the ColumnReport for one specific column — all the statistics computed for that column during diagnosis.

rep = column_summary(dataset, "age")
print(rep.details)
# {'type': 'numeric', 'mean': '27.4', 'std': '3.2', 'missing_pct': '10.0%', ...}

Understanding the Health Score

Every dataset starts at 100. Each detected issue deducts points based on its severity:

Severity	Points Lost	Example
CRITICAL	25	Data leakage confirmed, >60% missing values
HIGH	15	>30% missing, class imbalance ratio 10:1+
MEDIUM	8	Moderate skewness, some outliers, mild imbalance
LOW	3	A few minor outliers, slight skewness

Score	Status	What To Do
80 – 100	Healthy	Data is ready for modelling
50 – 79	Needs Work	Fix HIGH and CRITICAL issues before training
0 – 49	Critical	Do not train models yet

Project Structure

datadiagnose/
│
├── datadiagnose/          <- The Python package
│   ├── __init__.py        <- Public API
│   ├── core.py            <- Main diagnose() engine + get_stats_df()
│   ├── detectors.py       <- All 8 detector functions
│   ├── models.py          <- DiagnosisReport, Issue, ColumnReport + to_df()
│   └── utils.py           <- Pure math helpers (no dependencies)
│
├── tests/
│   ├── __init__.py
│   ├── sample_data.py     <- 11 sample datasets with known problems
│   ├── test_detectors.py  <- 60 unit tests for each detector
│   └── test_core.py       <- 80 integration tests for the full API
│
├── examples/
│   ├── basic_usage.py          <- Start here — every function shown
│   ├── pandas_integration.py   <- How to use with pandas DataFrames
│   └── student_dataset_demo.py <- Full workflow, step by step
│
├── .github/workflows/tests.yml <- Auto-run tests on every push
├── .flake8                     <- Code style configuration
├── .gitignore
├── LICENSE
├── README.md
└── pyproject.toml

Running the Tests

DataDiagnose has 140 tests covering every detector, every public function, and edge cases. Tests use only Python's built-in unittest — no pytest required (though pytest works too).

# Run all 140 tests at once
python -m unittest discover -s tests -v

# Run just the detector tests
python -m unittest tests.test_detectors -v

# Run just the core API tests
python -m unittest tests.test_core -v

# Run one specific test class
python -m unittest tests.test_detectors.TestMissingValueDetector -v

# Run one specific test
python -m unittest tests.test_detectors.TestMissingValueDetector.test_critical_missing_severity -v

All 140 tests should pass with output ending in:

----------------------------------------------------------------------
Ran 140 tests in 0.08s

OK

Running the Examples

# Simplest introduction — run this first
python examples/basic_usage.py

# How to use with pandas DataFrames
python examples/pandas_integration.py

# A full realistic data cleaning workflow, step by step
python examples/student_dataset_demo.py

Why Zero Runtime Dependencies?

DataDiagnose uses only Python's built-in math, statistics, and collections modules for its core logic. Pandas is optional and only needed if you want get_stats_df() or report.to_df(). This design was deliberate:

Works everywhere — any Python 3.7+ environment, nothing extra to install for the core library.

No version conflicts — requiring numpy or pandas would create compatibility problems for users who already have specific versions pinned in their projects.

Educational — every algorithm (IQR outlier detection, Pearson correlation, skewness) is written from scratch in plain Python. You can read the source code and learn exactly how each calculation works, not just how to call a black box.

Lightweight — five Python files, around 1000 lines, no compiled extensions, no C dependencies.

Design Decisions

Why does it detect problems but not fix them automatically?

Automatically modifying data without human judgment is dangerous. Filling missing values with the wrong strategy can make your model worse, not better. Whether you should drop a column or impute it — and what value to impute with — depends on your domain knowledge and what the data actually means in your context. DataDiagnose gives you the diagnosis and a specific recommendation for each issue. The decision and the action remain yours.

Why does it accept a dict-of-lists as its native format?

The original design needed zero dependencies, which meant no pandas. A plain Python dict is the simplest possible format. As of v1.0.1, DataFrames are also accepted directly. Internally, the library always converts to dict-of-lists format because column-wise operations on a dict are faster than on a DataFrame for this use case.

How to Contribute

Contributions are welcome. Here are some ideas from the roadmap:

HTML report export — a self-contained HTML file with charts and colour-coded issue cards
Correlation matrix — detect multicollinearity between feature columns
Web dashboard — a Flask/FastAPI endpoint where you upload a CSV and get a diagnosis in the browser
More detectors — mixed type columns, date range anomalies, text language detection

To contribute:

Fork the repository on GitHub
Create a branch: git checkout -b feature/my-improvement
Write your code and add tests for it
Confirm all 140 existing tests still pass: python -m unittest discover -s tests -v
Open a pull request with a clear description of what you changed and why

Changelog

v1.0.4

Refactor: Removed redundant class imbalance pre-filter from core.py — all guards are now handled entirely inside detectors.py. No behaviour change.

v1.0.3

No functional changes — version bump, confirmed all v1.0.2 fixes working correctly.

v1.0.2

Fixed: Proportional health scoring system — wide datasets (20+ columns) no longer score 0/100 due to accumulated medium/low issues
Fixed: Discrete numeric feature columns (bedrooms, bathrooms, stories, AGE, AUCTION YEAR) no longer falsely flagged as class-imbalanced. Only text columns, binary 0/1 columns, and the explicit target column are checked for imbalance.

v1.0.1

New: diagnose() now accepts pandas DataFrames directly — no .to_dict() needed
New: get_stats_df(dataset, target_col) — returns full column statistics as a Pandas DataFrame
New: report.to_df() — method on DiagnosisReport to get stats table as DataFrame
Fixed: Class imbalance detector no longer incorrectly flags continuous regression targets (e.g. house prices, salaries)

v1.0.0 — Initial Release

Eight detectors: missing values, outliers, skewness, class imbalance, data leakage, duplicate rows, constant columns, high cardinality
Feature engineering hints based on column name patterns (datetime, text, geo)
Model recommendation engine for regression, binary, and multiclass classification
Health score system (0–100) with severity-based point deductions
Full public API: diagnose, quick_scan, health_score, list_issues, get_suggestions, column_summary
140 unit and integration tests
Zero runtime dependencies

License

This project is licensed under the MIT License — see the LICENSE file for the full text.

In plain English: you can use this code for anything, including commercial projects, as long as you keep the copyright notice with the author's name in any copy you distribute.

Author

Nilotpal Dhar

Built as a beginner Python project to learn how data science diagnostics work from first principles. Every algorithm in this library — IQR outlier detection, Pearson correlation, skewness calculation — is implemented from scratch in plain Python so that reading the source code teaches you how the mathematics actually works, not just how to call a function.

If this library helped you, please star the repository on GitHub. If you found a bug or have a feature idea, open an issue — all feedback is welcome.

GitHub Repository

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.4

Apr 12, 2026

1.0.3

Mar 24, 2026

1.0.2

Mar 22, 2026

1.0.1

Mar 22, 2026

1.0.0

Mar 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadiagnose-1.0.4.tar.gz (44.9 kB view details)

Uploaded Apr 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datadiagnose-1.0.4-py3-none-any.whl (30.7 kB view details)

Uploaded Apr 12, 2026 Python 3

File details

Details for the file datadiagnose-1.0.4.tar.gz.

File metadata

Download URL: datadiagnose-1.0.4.tar.gz
Upload date: Apr 12, 2026
Size: 44.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for datadiagnose-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`6d8f60b7d516064032546ecda100a2f5c6f1dce79dbad35cbc3bed586d6692cf`
MD5	`ab3505e99ada630252bed308ba53fd1c`
BLAKE2b-256	`95d1f8910ec4a04118ea26e8df7dfb66458248c6b53ce9616dac4c22a1469426`

See more details on using hashes here.

File details

Details for the file datadiagnose-1.0.4-py3-none-any.whl.

File metadata

Download URL: datadiagnose-1.0.4-py3-none-any.whl
Upload date: Apr 12, 2026
Size: 30.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for datadiagnose-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`12e5c065cfe845aa665f660b4c9932dbbbad2290d07afa5439c80d256e2fa656`
MD5	`afb7f53995e88770d2779e4194f2a0b9`
BLAKE2b-256	`24d4a20ad65a6452e2ea49bd5885127292f9885236f1b09f49fe049652b5b6f0`

See more details on using hashes here.

datadiagnose 1.0.4

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

DataDiagnose

The Problem This Solves

What It Does

What Is New in v1.0.4

1. Cleaner internal architecture

2. Smarter class imbalance detection (carried forward from v1.0.2)

3. Proportional health scoring (carried forward from v1.0.2)

Installation

Quick Start

Works With pandas — Three Ways

Full API Reference

diagnose(dataset, target_col=None, dataset_name="dataset")

get_stats_df(dataset, target_col=None) — new in v1.0.1

report.to_df() — new in v1.0.1

quick_scan(dataset, target_col=None)

health_score(dataset, target_col=None)

list_issues(dataset, target_col=None)

get_suggestions(dataset, target_col=None)

column_summary(dataset, col_name, target_col=None)

Understanding the Health Score

Project Structure

Running the Tests

Running the Examples

Why Zero Runtime Dependencies?

Design Decisions

How to Contribute

Changelog

v1.0.4

v1.0.3

v1.0.2

v1.0.1

v1.0.0 — Initial Release

License

Author

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`diagnose(dataset, target_col=None, dataset_name="dataset")`

`get_stats_df(dataset, target_col=None)` — new in v1.0.1

`report.to_df()` — new in v1.0.1

`quick_scan(dataset, target_col=None)`

`health_score(dataset, target_col=None)`

`list_issues(dataset, target_col=None)`

`get_suggestions(dataset, target_col=None)`

`column_summary(dataset, col_name, target_col=None)`