Skip to main content

Dataset-agnostic data health report for tabular datasets

Project description

yreport

CI Status Python 3.10+ Coverage 94%

yreport is a lightweight, pipeline-ready data validation and deep diagnostics library for tabular ML datasets. It analyses data quality, detects potential issues, and provides honest, actionable diagnostics without making unsafe assumptions.

Unlike heavy EDA tools, yreport is designed to be pipeline-friendly, explainable, configurable, and production-aware.

Why yreport?

Most EDA libraries generate large HTML reports, make aggressive assumptions (e.g. one-hot everything), and are hard to integrate into ML pipelines.

yreport focuses on decisions, not decoration.

It helps answer:

  • Is this dataset usable?
  • Which columns are problematic?
  • What should be fixed first?
  • Where should I be careful before modelling?
  • Are my datetime columns healthy and leakage-free?
  • Which categorical columns will drift in production?

Features

  • Weighted Data Health Score (0–100)
  • Automatic column type detection
  • Missing value diagnostics with confidence levels
  • High-cardinality categorical detection
  • Numeric skewness and outlier analysis
  • Honest categorical handling (no forced one-hot / ordinal)
  • User override support
  • Non-contradictory recommendations
  • JSON and Markdown export
  • scikit-learn Pipeline integration
  • Lightweight and fast
  • v0.1.4 — Deep Diagnostics:
    • Datetime column diagnostics (gaps, frequency, timezone, future dates)
    • Categorical drift readiness checks
    • Missing pattern clustering (MCAR / MAR / MNAR inference)
    • Temporal leakage detection

Installation

Install from PyPI

pip install yreport

Install from source (recommended)

git clone https://github.com/yogeshkardile/yreport.git
cd yreport
pip install -e .

Core Concept

yreport does not modify your data.

It:

  • Inspects datasets
  • Reports potential issues
  • Suggests actions with confidence

It does not:

  • Apply transformations
  • Guess encoding methods
  • Perform feature engineering

This makes it safe and transparent.

Quick Start

import pandas as pd
from yreport import data_health_report

df = pd.read_csv("data.csv")

report = data_health_report(df)
report.summary()

Example console output:

Data Health Score: 87.95/100
Rows: 891 | Columns: 12
Duplicate Rows: 0

Numeric Columns    : ['Age', 'Fare', 'SibSp']
Categorical Columns: ['Embarked', 'Sex']
DateTime Columns   : ['booking_date']

Missing Percentage:
  - Cabin: 77.1%
  - Age: 19.87%

Warnings:
  - high_missing: ['Cabin']
  - high_cardinality: ['Name', 'Ticket']

Datetime Diagnostics:
  - booking_date: freq=D, issues=['no issues detected']

Categorical Drift Readiness:
  - Embarked: low drift risk (confidence=LOW)

Missing Pattern Clusters:
  MCAR: ['Age'] — safe to impute or drop; MNAR: ['Cabin'] — consider indicator variable

Temporal Leakage Detection:
  - booking_date: no leakage detected (confidence=LOW)

What the Report Includes

1. Data Health Score

A weighted score (0–100) based on:

  • Missing values (weight: 0.5)
  • Duplicate rows (weight: 0.3)
  • High-cardinality features (weight: 0.2)

2. Column Type Detection

Automatically detects:

  • Numeric columns
  • Categorical columns
  • Datetime columns

3. Missing Value Diagnostics

  • Missing percentage per column
  • Drop or impute recommendations
  • Confidence levels: HIGH / MEDIUM

4. Categorical Diagnostics

  • Flags categorical columns that require encoding
  • Detects high-cardinality features (> 50 unique values)
  • Does not assume one-hot or ordinal encoding

5. Numeric Diagnostics

For numeric columns:

  • Skewness
  • Outlier percentage (IQR method)
  • Transform suggestions (log / robust)

6. Datetime Diagnostics (v0.1.4)

For each datetime column:

  • Inferred frequency (daily, monthly, irregular, …)
  • Gap detection — flags gaps more than 5× the median gap
  • Timezone awareness check
  • Future-date contamination count
  • Monotonicity check

7. Categorical Drift Readiness (v0.1.4)

Flags columns likely to behave differently at inference:

  • High cardinality (> 50 unique values) — unseen categories at inference time
  • Rare categories (< 1% frequency) — vulnerable to distribution shift
  • Near-constant columns (> 95% one category) — low signal
  • High-entropy distributions — encoding instability risk

8. Missing Pattern Clustering (v0.1.4)

Groups columns by their shared missing-value pattern and infers the likely mechanism:

Mechanism Meaning Suggested Action
MCAR Missing completely at random — isolated pattern Safe to impute or drop
MAR Missing at random — shared pattern with other columns Impute using related columns
MNAR Missing not at random — structural reason suspected Add indicator variable or apply domain fix

9. Temporal Leakage Detection (v0.1.4)

Scans datetime columns for train/test leakage risks:

  • Future-dated rows (post-prediction information in features)
  • Target-period overlap (values in the last 10% of the date range)
  • High index correlation (> 0.95) — column encodes row ordering
  • Near-duplicate datetime columns (correlation > 0.98)

User Overrides

Automatic detection is never perfect. yreport allows explicit user control.

data_health_report(
    df,
    ignore_cols=[...],
    drop_cols=[...],
    categorical_cols=[...],
    numeric_cols=[...]
)
Override Purpose
ignore_cols Completely ignore columns from all analysis
drop_cols Force drop columns in recommendations
categorical_cols Force categorical treatment
numeric_cols Force numeric treatment

Rules:

  • User intent always overrides automation
  • A column belongs to only one semantic type
  • Ignored or dropped columns are excluded everywhere

Exporting Reports

JSON export (machine-readable):

report.to_json("report.json")
# or get the dict directly
data = report.to_json()

The JSON output includes all deep diagnostic fields:

{
  "health_score": 87.95,
  "shape": {"rows": 891, "columns": 12},
  "numeric_diagnostics": { ... },
  "datetime_diagnostics": { ... },
  "drift_readiness": { ... },
  "missing_patterns": { ... },
  "leakage_report": { ... }
}

Markdown export (human-readable):

report.to_markdown("report.md")

The Markdown report includes all sections including deep diagnostics, with a formatted table for missing pattern clusters.

scikit-learn Pipeline Integration

yreport provides a no-op sklearn inspector that lets you observe data during training without interfering with the model or the pipeline.

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from yreport import YReportInspector

pipe = Pipeline([
    ("inspect", YReportInspector(
        categorical_cols=["Pclass"],
        ignore_cols=["Name"]
    )),
    ("model", LogisticRegression(max_iter=1000))
])

pipe.fit(X_train, y_train)

# Access full report after fit
pipe.named_steps["inspect"].report_.summary()
pipe.named_steps["inspect"].report_.to_markdown("train_report.md")

# Access deep diagnostics directly
drift = pipe.named_steps["inspect"].report_.drift_readiness
leakage = pipe.named_steps["inspect"].report_.leakage_report
  • Model trains normally
  • Data remains unchanged
  • Full report (including deep diagnostics) is available after fit()

Report Fields Reference

Field Type Description
health_score float Weighted score 0–100
shape dict {rows, columns}
column_types dict {numeric, categorical, datetime} lists
missing_percentage dict Per-column missing %
duplicate_rows int Count of duplicate rows
warnings dict high_missing, high_cardinality lists
recommendations dict encoding and missing action dicts
numeric_diagnostics dict Skewness, outliers, transform advice
datetime_diagnostics dict Frequency, gaps, timezone, future dates (v0.1.4)
drift_readiness dict Per-column drift risk flags (v0.1.4)
missing_patterns dict MCAR/MAR/MNAR cluster analysis (v0.1.4)
leakage_report dict Temporal leakage risks per datetime column (v0.1.4)

Testing

Run tests from the project root:

pytest

Includes:

  • sklearn pipeline compatibility test
  • Core API regression protection
  • Deep diagnostics coverage

Design Philosophy

Principle Approach
Correctness > Automation Reports issues; does not silently fix them
Transparency > Guessing All recommendations include confidence levels
Diagnostics > Decoration Output is structured and machine-readable
User intent > Heuristics Overrides always take precedence

yreport will never silently apply transformations.

What yreport is NOT

  • An AutoML tool
  • A feature engineering pipeline
  • A visualization-heavy EDA library
  • An encoding decision engine

This is intentional. yreport is a diagnostics layer, not a transformation layer.

License

This project is licensed under the terms specified in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yreport-0.1.4.tar.gz (24.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yreport-0.1.4-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file yreport-0.1.4.tar.gz.

File metadata

  • Download URL: yreport-0.1.4.tar.gz
  • Upload date:
  • Size: 24.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for yreport-0.1.4.tar.gz
Algorithm Hash digest
SHA256 8c0352f6eb153eca3af3e04a8f1ddb944b6e8059ed62d4f3e911c751b83fe6ed
MD5 752065df3d7b58daf3711691b16ba766
BLAKE2b-256 2af131dc4cfb8d7066c2f3011ba16a52b05fd3ce62350f940e270d8c2ab209d9

See more details on using hashes here.

File details

Details for the file yreport-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: yreport-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for yreport-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 66a8261b89b94dbf4bc58219660e712a5ace81dabde0ca077d9d8ffb95f12821
MD5 013197141a880b0d61b43ef72b4c9299
BLAKE2b-256 7441551cdc88d52a641c8e8c2debfd71a4baf1916d11881e81b4cfd353f605ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page