Dataset-agnostic data health report for tabular datasets

These details have not been verified by PyPI

Project links

Project description

yreport

yreport is a lightweight, pipeline-ready data validation and deep diagnostics library for tabular ML datasets. It analyses data quality, detects potential issues, and provides honest, actionable diagnostics without making unsafe assumptions.

Unlike heavy EDA tools, yreport is designed to be pipeline-friendly, explainable, configurable, and production-aware.

Why yreport?

Most EDA libraries generate large HTML reports, make aggressive assumptions (e.g. one-hot everything), and are hard to integrate into ML pipelines.

yreport focuses on decisions, not decoration.

It helps answer:

Is this dataset usable?
Which columns are problematic?
What should be fixed first?
Where should I be careful before modelling?
Are my datetime columns healthy and leakage-free?
Which categorical columns will drift in production?

Features

Weighted Data Health Score (0–100)
Automatic column type detection
Missing value diagnostics with confidence levels
High-cardinality categorical detection
Numeric skewness and outlier analysis
Honest categorical handling (no forced one-hot / ordinal)
User override support
Non-contradictory recommendations
JSON and Markdown export
scikit-learn Pipeline integration
Lightweight and fast
v0.1.4 — Deep Diagnostics:
- Datetime column diagnostics (gaps, frequency, timezone, future dates)
- Categorical drift readiness checks
- Missing pattern clustering (MCAR / MAR / MNAR inference)
- Temporal leakage detection

Installation

Install from PyPI

pip install yreport

Install from source (recommended)

git clone https://github.com/yogeshkardile/yreport.git
cd yreport
pip install -e .

Core Concept

yreport does not modify your data.

It:

Inspects datasets
Reports potential issues
Suggests actions with confidence

It does not:

Apply transformations
Guess encoding methods
Perform feature engineering

This makes it safe and transparent.

Quick Start

import pandas as pd
from yreport import data_health_report

df = pd.read_csv("data.csv")

report = data_health_report(df)
report.summary()

Example console output:

Data Health Score: 87.95/100
Rows: 891 | Columns: 12
Duplicate Rows: 0

Numeric Columns    : ['Age', 'Fare', 'SibSp']
Categorical Columns: ['Embarked', 'Sex']
DateTime Columns   : ['booking_date']

Missing Percentage:
  - Cabin: 77.1%
  - Age: 19.87%

Warnings:
  - high_missing: ['Cabin']
  - high_cardinality: ['Name', 'Ticket']

Datetime Diagnostics:
  - booking_date: freq=D, issues=['no issues detected']

Categorical Drift Readiness:
  - Embarked: low drift risk (confidence=LOW)

Missing Pattern Clusters:
  MCAR: ['Age'] — safe to impute or drop; MNAR: ['Cabin'] — consider indicator variable

Temporal Leakage Detection:
  - booking_date: no leakage detected (confidence=LOW)

What the Report Includes

1. Data Health Score

A weighted score (0–100) based on:

Missing values (weight: 0.5)
Duplicate rows (weight: 0.3)
High-cardinality features (weight: 0.2)

2. Column Type Detection

Automatically detects:

Numeric columns
Categorical columns
Datetime columns

3. Missing Value Diagnostics

Missing percentage per column
Drop or impute recommendations
Confidence levels: HIGH / MEDIUM

4. Categorical Diagnostics

Flags categorical columns that require encoding
Detects high-cardinality features (> 50 unique values)
Does not assume one-hot or ordinal encoding

5. Numeric Diagnostics

For numeric columns:

Skewness
Outlier percentage (IQR method)
Transform suggestions (log / robust)

6. Datetime Diagnostics (v0.1.4)

For each datetime column:

Inferred frequency (daily, monthly, irregular, …)
Gap detection — flags gaps more than 5× the median gap
Timezone awareness check
Future-date contamination count
Monotonicity check

7. Categorical Drift Readiness (v0.1.4)

Flags columns likely to behave differently at inference:

High cardinality (> 50 unique values) — unseen categories at inference time
Rare categories (< 1% frequency) — vulnerable to distribution shift
Near-constant columns (> 95% one category) — low signal
High-entropy distributions — encoding instability risk

8. Missing Pattern Clustering (v0.1.4)

Groups columns by their shared missing-value pattern and infers the likely mechanism:

Mechanism	Meaning	Suggested Action
MCAR	Missing completely at random — isolated pattern	Safe to impute or drop
MAR	Missing at random — shared pattern with other columns	Impute using related columns
MNAR	Missing not at random — structural reason suspected	Add indicator variable or apply domain fix

9. Temporal Leakage Detection (v0.1.4)

Scans datetime columns for train/test leakage risks:

Future-dated rows (post-prediction information in features)
Target-period overlap (values in the last 10% of the date range)
High index correlation (> 0.95) — column encodes row ordering
Near-duplicate datetime columns (correlation > 0.98)

User Overrides

Automatic detection is never perfect. yreport allows explicit user control.

data_health_report(
    df,
    ignore_cols=[...],
    drop_cols=[...],
    categorical_cols=[...],
    numeric_cols=[...]
)

Override	Purpose
`ignore_cols`	Completely ignore columns from all analysis
`drop_cols`	Force drop columns in recommendations
`categorical_cols`	Force categorical treatment
`numeric_cols`	Force numeric treatment

Rules:

User intent always overrides automation
A column belongs to only one semantic type
Ignored or dropped columns are excluded everywhere

Exporting Reports

JSON export (machine-readable):

report.to_json("report.json")
# or get the dict directly
data = report.to_json()

The JSON output includes all deep diagnostic fields:

{
  "health_score": 87.95,
  "shape": {"rows": 891, "columns": 12},
  "numeric_diagnostics": { ... },
  "datetime_diagnostics": { ... },
  "drift_readiness": { ... },
  "missing_patterns": { ... },
  "leakage_report": { ... }
}

Markdown export (human-readable):

report.to_markdown("report.md")

The Markdown report includes all sections including deep diagnostics, with a formatted table for missing pattern clusters.

scikit-learn Pipeline Integration

yreport provides a no-op sklearn inspector that lets you observe data during training without interfering with the model or the pipeline.

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from yreport import YReportInspector

pipe = Pipeline([
    ("inspect", YReportInspector(
        categorical_cols=["Pclass"],
        ignore_cols=["Name"]
    )),
    ("model", LogisticRegression(max_iter=1000))
])

pipe.fit(X_train, y_train)

# Access full report after fit
pipe.named_steps["inspect"].report_.summary()
pipe.named_steps["inspect"].report_.to_markdown("train_report.md")

# Access deep diagnostics directly
drift = pipe.named_steps["inspect"].report_.drift_readiness
leakage = pipe.named_steps["inspect"].report_.leakage_report

Model trains normally
Data remains unchanged
Full report (including deep diagnostics) is available after fit()

Report Fields Reference

Field	Type	Description
`health_score`	float	Weighted score 0–100
`shape`	dict	`{rows, columns}`
`column_types`	dict	`{numeric, categorical, datetime}` lists
`missing_percentage`	dict	Per-column missing %
`duplicate_rows`	int	Count of duplicate rows
`warnings`	dict	`high_missing`, `high_cardinality` lists
`recommendations`	dict	`encoding` and `missing` action dicts
`numeric_diagnostics`	dict	Skewness, outliers, transform advice
`datetime_diagnostics`	dict	Frequency, gaps, timezone, future dates (v0.1.4)
`drift_readiness`	dict	Per-column drift risk flags (v0.1.4)
`missing_patterns`	dict	MCAR/MAR/MNAR cluster analysis (v0.1.4)
`leakage_report`	dict	Temporal leakage risks per datetime column (v0.1.4)

Testing

Run tests from the project root:

pytest

Includes:

sklearn pipeline compatibility test
Core API regression protection
Deep diagnostics coverage

Design Philosophy

Principle	Approach
Correctness > Automation	Reports issues; does not silently fix them
Transparency > Guessing	All recommendations include confidence levels
Diagnostics > Decoration	Output is structured and machine-readable
User intent > Heuristics	Overrides always take precedence

yreport will never silently apply transformations.

What yreport is NOT

An AutoML tool
A feature engineering pipeline
A visualization-heavy EDA library
An encoding decision engine

This is intentional. yreport is a diagnostics layer, not a transformation layer.

License

This project is licensed under the terms specified in the LICENSE file.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.4

Jun 21, 2026

0.1.3

Jan 1, 2026

0.1.1

Dec 19, 2025

0.1.0

Dec 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yreport-0.1.4.tar.gz (24.8 kB view details)

Uploaded Jun 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yreport-0.1.4-py3-none-any.whl (18.5 kB view details)

Uploaded Jun 21, 2026 Python 3

File details

Details for the file yreport-0.1.4.tar.gz.

File metadata

Download URL: yreport-0.1.4.tar.gz
Upload date: Jun 21, 2026
Size: 24.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for yreport-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`8c0352f6eb153eca3af3e04a8f1ddb944b6e8059ed62d4f3e911c751b83fe6ed`
MD5	`752065df3d7b58daf3711691b16ba766`
BLAKE2b-256	`2af131dc4cfb8d7066c2f3011ba16a52b05fd3ce62350f940e270d8c2ab209d9`

See more details on using hashes here.

File details

Details for the file yreport-0.1.4-py3-none-any.whl.

File metadata

Download URL: yreport-0.1.4-py3-none-any.whl
Upload date: Jun 21, 2026
Size: 18.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for yreport-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`66a8261b89b94dbf4bc58219660e712a5ace81dabde0ca077d9d8ffb95f12821`
MD5	`013197141a880b0d61b43ef72b4c9299`
BLAKE2b-256	`7441551cdc88d52a641c8e8c2debfd71a4baf1916d11881e81b4cfd353f605ac`

See more details on using hashes here.

yreport 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

yreport

Why yreport?

Features

Installation

Install from PyPI

Install from source (recommended)

Core Concept

Quick Start

What the Report Includes

1. Data Health Score

2. Column Type Detection

3. Missing Value Diagnostics

4. Categorical Diagnostics

5. Numeric Diagnostics

6. Datetime Diagnostics (v0.1.4)

7. Categorical Drift Readiness (v0.1.4)

8. Missing Pattern Clustering (v0.1.4)

9. Temporal Leakage Detection (v0.1.4)

User Overrides

Exporting Reports

scikit-learn Pipeline Integration

Report Fields Reference

Testing

Design Philosophy

What yreport is NOT

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes