Skip to main content

Dataset Auto-Diagnosis Python Library — find and fix data problems before model training.

Project description

DataDiagnose

A zero-dependency Python library that audits your dataset and prescribes fixes before you train your model.

Tests Python Version License: MIT Dependencies


The Problem This Solves

Every beginner data scientist goes through the same painful experience. You download a dataset, you are excited to build your first model, you train it — and the results are terrible. 50% accuracy. Predictions that make no sense. Hours of confusion.

In most cases the model is not the problem. The data is broken. There are missing values pulling your model in the wrong direction. An outlier like an age of 900 years is destroying your statistics. Your target column has 95% of rows saying "no", so your model just learned to say "no" for everything.

Experienced data scientists know to check for these things before touching a model. They run a full diagnostic on the dataset first. DataDiagnose automates that diagnostic in one function call.


What It Does

Give DataDiagnose any dataset and it returns:

  • A health score from 0 to 100 showing how clean your data is
  • A list of every problem detected, with severity (CRITICAL / HIGH / MEDIUM / LOW)
  • A specific fix suggestion for every problem
  • Model recommendations based on your data characteristics
  • Feature engineering hints based on your column names
  • A Pandas DataFrame of all column statistics with one call (new in v1.0.3)

Eight problems are detected automatically:

Problem What It Means
Missing Values Null, empty, or NaN values in any column
Outliers Extreme values detected by both IQR and Z-score methods
Skewness Lopsided distributions that hurt linear models
Class Imbalance One class vastly outnumbering others in your target
Data Leakage Columns that secretly contain the answer
Duplicate Rows Identical rows that bias your model
Constant Columns Columns with zero variation — zero information
High Cardinality ID-like columns with almost all unique values

What Is New in v1.0.4

1. Cleaner internal architecture

The redundant class imbalance pre-filter in core.py has been removed. All detection guards (regression target guard, numeric feature guard, binary column check) are now handled entirely inside detectors.py where they belong. This makes the code easier to read and extend — the detector is fully self-contained.

No change to behaviour — the output you get from diagnose() is identical to v1.0.3.

2. Smarter class imbalance detection (carried forward from v1.0.2)

Discrete numeric feature columns like bedrooms, bathrooms, stories, parking, AGE, AUCTION YEAR are no longer falsely flagged as imbalanced. The rule is precise:

  • Text/categorical columns (COUNTRY, TEAM, PhoneService) → imbalance checked ✅
  • Binary numeric columns (SeniorCitizen with values 0/1) → imbalance checked ✅
  • Target column → imbalance checked ✅
  • Discrete numeric features with 3+ unique values → stats recorded, no issue raised ✅

3. Proportional health scoring (carried forward from v1.0.2)

Wide datasets with 20+ columns no longer score 0/100. The scoring cap means a dataset with many minor skewness warnings scores fairly, while genuinely broken data with multiple CRITICAL issues still scores near 0.


Installation

DataDiagnose has zero runtime dependencies for its core features. It runs entirely on Python's standard library.

# Once published on PyPI
pip install datadiagnose

For now, copy the datadiagnose/ folder into your project and import directly.

To use get_stats_df() or pass a DataFrame directly, pandas must be installed:

pip install pandas

Quick Start

from datadiagnose import diagnose

dataset = {
    "age":    [25, 30, None, 22, 900, 28],
    "income": [50000, 60000, None, 48000, 52000, 61000],
    "city":   ["KOL", "MUM", "KOL", "DEL", "KOL", "KOL"],
    "target": [1, 0, 1, 0, 1, 0],
}

report = diagnose(dataset, target_col="target")
print(report)

Output:

==============================================================
  DATADIAGNOSE REPORT — DATASET
==============================================================
  Rows    : 6
  Columns : 4
  Score   : 84/100   Needs Work
--------------------------------------------------------------
  Issues Found (2)

  1. MEDIUM
     Missing Values in 'age'
     -> 16.7% of values are missing.
     Fix: Fill 'age' with median (numeric) or mode (categorical).

  2. MEDIUM
     Missing Values in 'income'
     -> 16.7% of values are missing.
     Fix: Fill 'income' with median (numeric) or mode (categorical).
...

Works With pandas — Three Ways

import pandas as pd
import datadiagnose as dd

df = pd.read_csv("titanic.csv")

# Way 1 — Pass DataFrame directly (new in v1.0.1)
report = dd.diagnose(df, target_col="Survived")

# Way 2 — Convert manually (original method, still works)
report = dd.diagnose(df.to_dict(orient="list"), target_col="Survived")

# Way 3 — Get all stats as a DataFrame instantly (new in v1.0.1)
df_stats = dd.get_stats_df(df, target_col="Survived")
print(df_stats)

Full API Reference

diagnose(dataset, target_col=None, dataset_name="dataset")

The main function. Runs all eight detectors and returns a DiagnosisReport.

New in v1.0.1: accepts pandas DataFrames directly without conversion.

report = diagnose(dataset, target_col="label", dataset_name="Titanic")

report.score          # int — health score 0-100
report.issues         # list of Issue objects
report.suggestions    # list of fix strings
report.model_types    # list of recommended model names
report.column_reports # dict — {col_name: ColumnReport}
report.n_rows         # int
report.n_cols         # int

get_stats_df(dataset, target_col=None)new in v1.0.1

Runs a full diagnosis and returns all column statistics as a transposed Pandas DataFrame. Metrics are rows, your dataset columns are the headers.

import datadiagnose as dd

df_stats = dd.get_stats_df(my_dataset, target_col="target")
# Returns DataFrame — perfect for Jupyter notebooks
# Requires: pip install pandas

report.to_df()new in v1.0.1

Method directly on the DiagnosisReport object. Does the same thing as get_stats_df() but you call it after you already have a report.

report   = diagnose(my_dataset, target_col="target")
df_stats = report.to_df()

quick_scan(dataset, target_col=None)

One-liner that runs the diagnosis and immediately prints the report to your terminal. Returns the DiagnosisReport object in case you need it.

quick_scan(dataset, target_col="label")

health_score(dataset, target_col=None)

Returns only the integer score (0–100). Perfect for quality gates in automated pipelines — block model training if data quality is too low.

score = health_score(dataset, target_col="label")

if score < 70:
    raise ValueError(f"Data quality too low: {score}/100. Fix issues first.")

list_issues(dataset, target_col=None)

Returns a concise list of (severity, title) tuples for quick scanning without reading the full report.

for severity, title in list_issues(dataset, "label"):
    print(f"[{severity}] {title}")

get_suggestions(dataset, target_col=None)

Returns only the list of actionable fix suggestions as plain strings.

for tip in get_suggestions(dataset, "label"):
    print("-", tip)

column_summary(dataset, col_name, target_col=None)

Returns the ColumnReport for one specific column — all the statistics computed for that column during diagnosis.

rep = column_summary(dataset, "age")
print(rep.details)
# {'type': 'numeric', 'mean': '27.4', 'std': '3.2', 'missing_pct': '10.0%', ...}

Understanding the Health Score

Every dataset starts at 100. Each detected issue deducts points based on its severity:

Severity Points Lost Example
CRITICAL 25 Data leakage confirmed, >60% missing values
HIGH 15 >30% missing, class imbalance ratio 10:1+
MEDIUM 8 Moderate skewness, some outliers, mild imbalance
LOW 3 A few minor outliers, slight skewness
Score Status What To Do
80 – 100 Healthy Data is ready for modelling
50 – 79 Needs Work Fix HIGH and CRITICAL issues before training
0 – 49 Critical Do not train models yet

Project Structure

datadiagnose/
│
├── datadiagnose/          <- The Python package
│   ├── __init__.py        <- Public API
│   ├── core.py            <- Main diagnose() engine + get_stats_df()
│   ├── detectors.py       <- All 8 detector functions
│   ├── models.py          <- DiagnosisReport, Issue, ColumnReport + to_df()
│   └── utils.py           <- Pure math helpers (no dependencies)
│
├── tests/
│   ├── __init__.py
│   ├── sample_data.py     <- 11 sample datasets with known problems
│   ├── test_detectors.py  <- 60 unit tests for each detector
│   └── test_core.py       <- 80 integration tests for the full API
│
├── examples/
│   ├── basic_usage.py          <- Start here — every function shown
│   ├── pandas_integration.py   <- How to use with pandas DataFrames
│   └── student_dataset_demo.py <- Full workflow, step by step
│
├── .github/workflows/tests.yml <- Auto-run tests on every push
├── .flake8                     <- Code style configuration
├── .gitignore
├── LICENSE
├── README.md
└── pyproject.toml

Running the Tests

DataDiagnose has 140 tests covering every detector, every public function, and edge cases. Tests use only Python's built-in unittest — no pytest required (though pytest works too).

# Run all 140 tests at once
python -m unittest discover -s tests -v

# Run just the detector tests
python -m unittest tests.test_detectors -v

# Run just the core API tests
python -m unittest tests.test_core -v

# Run one specific test class
python -m unittest tests.test_detectors.TestMissingValueDetector -v

# Run one specific test
python -m unittest tests.test_detectors.TestMissingValueDetector.test_critical_missing_severity -v

All 140 tests should pass with output ending in:

----------------------------------------------------------------------
Ran 140 tests in 0.08s

OK

Running the Examples

# Simplest introduction — run this first
python examples/basic_usage.py

# How to use with pandas DataFrames
python examples/pandas_integration.py

# A full realistic data cleaning workflow, step by step
python examples/student_dataset_demo.py

Why Zero Runtime Dependencies?

DataDiagnose uses only Python's built-in math, statistics, and collections modules for its core logic. Pandas is optional and only needed if you want get_stats_df() or report.to_df(). This design was deliberate:

Works everywhere — any Python 3.7+ environment, nothing extra to install for the core library.

No version conflicts — requiring numpy or pandas would create compatibility problems for users who already have specific versions pinned in their projects.

Educational — every algorithm (IQR outlier detection, Pearson correlation, skewness) is written from scratch in plain Python. You can read the source code and learn exactly how each calculation works, not just how to call a black box.

Lightweight — five Python files, around 1000 lines, no compiled extensions, no C dependencies.


Design Decisions

Why does it detect problems but not fix them automatically?

Automatically modifying data without human judgment is dangerous. Filling missing values with the wrong strategy can make your model worse, not better. Whether you should drop a column or impute it — and what value to impute with — depends on your domain knowledge and what the data actually means in your context. DataDiagnose gives you the diagnosis and a specific recommendation for each issue. The decision and the action remain yours.

Why does it accept a dict-of-lists as its native format?

The original design needed zero dependencies, which meant no pandas. A plain Python dict is the simplest possible format. As of v1.0.1, DataFrames are also accepted directly. Internally, the library always converts to dict-of-lists format because column-wise operations on a dict are faster than on a DataFrame for this use case.


How to Contribute

Contributions are welcome. Here are some ideas from the roadmap:

  • HTML report export — a self-contained HTML file with charts and colour-coded issue cards
  • Correlation matrix — detect multicollinearity between feature columns
  • Web dashboard — a Flask/FastAPI endpoint where you upload a CSV and get a diagnosis in the browser
  • More detectors — mixed type columns, date range anomalies, text language detection

To contribute:

  1. Fork the repository on GitHub
  2. Create a branch: git checkout -b feature/my-improvement
  3. Write your code and add tests for it
  4. Confirm all 140 existing tests still pass: python -m unittest discover -s tests -v
  5. Open a pull request with a clear description of what you changed and why

Changelog

v1.0.4

  • Refactor: Removed redundant class imbalance pre-filter from core.py — all guards are now handled entirely inside detectors.py. No behaviour change.

v1.0.3

  • No functional changes — version bump, confirmed all v1.0.2 fixes working correctly.

v1.0.2

  • Fixed: Proportional health scoring system — wide datasets (20+ columns) no longer score 0/100 due to accumulated medium/low issues
  • Fixed: Discrete numeric feature columns (bedrooms, bathrooms, stories, AGE, AUCTION YEAR) no longer falsely flagged as class-imbalanced. Only text columns, binary 0/1 columns, and the explicit target column are checked for imbalance.

v1.0.1

  • New: diagnose() now accepts pandas DataFrames directly — no .to_dict() needed
  • New: get_stats_df(dataset, target_col) — returns full column statistics as a Pandas DataFrame
  • New: report.to_df() — method on DiagnosisReport to get stats table as DataFrame
  • Fixed: Class imbalance detector no longer incorrectly flags continuous regression targets (e.g. house prices, salaries)

v1.0.0 — Initial Release

  • Eight detectors: missing values, outliers, skewness, class imbalance, data leakage, duplicate rows, constant columns, high cardinality
  • Feature engineering hints based on column name patterns (datetime, text, geo)
  • Model recommendation engine for regression, binary, and multiclass classification
  • Health score system (0–100) with severity-based point deductions
  • Full public API: diagnose, quick_scan, health_score, list_issues, get_suggestions, column_summary
  • 140 unit and integration tests
  • Zero runtime dependencies

License

This project is licensed under the MIT License — see the LICENSE file for the full text.

In plain English: you can use this code for anything, including commercial projects, as long as you keep the copyright notice with the author's name in any copy you distribute.

Copyright (c) 2026 Nilotpal Dhar


Author

Nilotpal Dhar

Built as a beginner Python project to learn how data science diagnostics work from first principles. Every algorithm in this library — IQR outlier detection, Pearson correlation, skewness calculation — is implemented from scratch in plain Python so that reading the source code teaches you how the mathematics actually works, not just how to call a function.

If this library helped you, please star the repository on GitHub. If you found a bug or have a feature idea, open an issue — all feedback is welcome.

GitHub Repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadiagnose-1.0.4.tar.gz (44.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datadiagnose-1.0.4-py3-none-any.whl (30.7 kB view details)

Uploaded Python 3

File details

Details for the file datadiagnose-1.0.4.tar.gz.

File metadata

  • Download URL: datadiagnose-1.0.4.tar.gz
  • Upload date:
  • Size: 44.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for datadiagnose-1.0.4.tar.gz
Algorithm Hash digest
SHA256 6d8f60b7d516064032546ecda100a2f5c6f1dce79dbad35cbc3bed586d6692cf
MD5 ab3505e99ada630252bed308ba53fd1c
BLAKE2b-256 95d1f8910ec4a04118ea26e8df7dfb66458248c6b53ce9616dac4c22a1469426

See more details on using hashes here.

File details

Details for the file datadiagnose-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: datadiagnose-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 30.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for datadiagnose-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 12e5c065cfe845aa665f660b4c9932dbbbad2290d07afa5439c80d256e2fa656
MD5 afb7f53995e88770d2779e4194f2a0b9
BLAKE2b-256 24d4a20ad65a6452e2ea49bd5885127292f9885236f1b09f49fe049652b5b6f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page