Dataset-agnostic data health report for tabular datasets

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

📊 yreport

yreport is a lightweight, dataset-agnostic data health reporting library for tabular datasets.
It analyzes data quality, detects potential issues, and provides honest, actionable diagnostics without making unsafe assumptions.

Unlike heavy EDA tools, yreport is designed to be:

Pipeline-friendly
Explainable
Configurable
Production-aware

🚀 Why yreport?

Most EDA libraries:

Generate large HTML reports
Make aggressive assumptions (e.g. one-hot everything)
Are hard to integrate into ML pipelines

yreport focuses on decisions, not decoration.

It helps answer:

Is this dataset usable?
Which columns are problematic?
What should be fixed first?
Where should I be careful before modeling?

✨ Features

Weighted Data Health Score (0–100)
Automatic column type detection
Missing value diagnostics with confidence levels
High-cardinality categorical detection
Numeric skewness and outlier analysis
Honest categorical handling (no forced one-hot / ordinal)
User override support
Non-contradictory recommendations
JSON and Markdown export
scikit-learn Pipeline integration
Lightweight and fast

📦 Installation

Install from source (recommended)

git clone https://github.com/your-username/yreport.git
cd yreport
pip install -e .

🧠 Core Concept

yreport does not modify your data.

It:

Inspects datasets
Reports potential issues
Suggests actions with confidence

It does not:

Apply transformations
Guess encoding methods
Perform feature engineering
This makes it safe and transparent.

import pandas as pd
from yreport import data_health_report

df = pd.read_csv("data.csv")

report = data_health_report(df)
report.summary()

Example Console Output: Data Health Score: 87.95/100 Rows: 891 | No_Columns: 12

Warnings:

high_missing: ['Cabin']
high_cardinality: ['Name', 'Ticket']

📋 What the Report Includes

1️⃣ Data Health Score A weighted score based on:

Missing values
Duplicate rows
High-cardinality features

2️⃣ Column Type Detection Automatically detects:

Numeric columns
Categorical columns
Datetime columns

3️⃣ Missing Value Diagnostics

Missing percentage per column
Drop or impute recommendations
Confidence levels: HIGH / MEDIUM

4️⃣ Categorical Diagnostics

Flags categorical columns that require encoding
Detects high-cardinality features
Does not assume one-hot or ordinal encoding

5️⃣ Numeric Diagnostics For numeric columns:

Skewness
Outlier percentage (IQR method)
Transform suggestions (log / robust)

🧩 User Overrides

Automatic detection is never perfect. yreport allows explicit user control.

Supported Overrides:

data_health_report(
    df,
    ignore_cols=[...],
    drop_cols=[...],
    categorical_cols=[...],
    numerical_cols=[...]
)

Meaning of Overides:

Override	Purpose
`ignore_cols`	Completely ignore columns
`drop_cols`	Force drop columns
`categorical_cols`	Force categorical treatment
`numerical_cols`	Force numeric treatment

Rules:

User intent always overrides automation
A column belongs to only one semantic type
Ignored or dropped columns are excluded everywhere

📤 Exporting Reports

JSON Export (machine-readable):

report.to_json("report.json")/data = report.to_json()

Markdown Export (human-readable)

report.to_markdown("report.md")

🤖 scikit-learn Pipeline Integration

yreport provides a no-op sklearn inspector. Why?

Observe data during training
Do not interfere with models
Keep pipelines clean

Example

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from yreport import YReportInspector

pipe = Pipeline([
    ("inspect", YReportInspector(
        categorical_cols=["Pclass"],
        ignore_cols=["Name"]
    )),
    ("model", LogisticRegression(max_iter=1000))
])

pipe.fit(X_train, y_train)

pipe.named_steps["inspect"].report_.summary()
pipe.named_steps["inspect"].report_.to_markdown("train_report.md")

✔ Model trains normally ✔ Data remains unchanged ✔ Report is available after fit()

🧪 Testing

Run tests from the project root:

pytest

Includes:

sklearn pipeline compatibility test
Core API regression protection

🧠 Design Philosophy

Correctness > Automation
Transparency > Guessing
Diagnostics > Decoration
User intent > Heuristics

yreport will never silently apply transformations.

🚧 What yreport is NOT

AutoML tool
Feature engineering pipeline
Visualization-heavy EDA
Encoding decision engine

This is intentional.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.4

Jun 21, 2026

0.1.3

Jan 1, 2026

0.1.1

Dec 19, 2025

This version

0.1.0

Dec 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yreport-0.1.0.tar.gz (10.5 kB view details)

Uploaded Dec 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yreport-0.1.0-py3-none-any.whl (9.0 kB view details)

Uploaded Dec 19, 2025 Python 3

File details

Details for the file yreport-0.1.0.tar.gz.

File metadata

Download URL: yreport-0.1.0.tar.gz
Upload date: Dec 19, 2025
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for yreport-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0979a9915dbbd0f1e1a85f41ba78a5bc67a052d98d544acae678d6ac07e3533c`
MD5	`9414287bed604cfd502275011c846dd7`
BLAKE2b-256	`37d69f536516e3db13b9db79973c96fdbba5556365a79ef35a030fe611d777c6`

See more details on using hashes here.

File details

Details for the file yreport-0.1.0-py3-none-any.whl.

File metadata

Download URL: yreport-0.1.0-py3-none-any.whl
Upload date: Dec 19, 2025
Size: 9.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for yreport-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5d74b623b3f606b1b6a53b82696a693ca3d3369ac345ec033ff86192138127ad`
MD5	`70b571b3746d9b309daa1532fcad43f2`
BLAKE2b-256	`d5cf9c6df59671c6fb164004d3cebfa055253428e8ad7ab6a085aef7754f00d3`

See more details on using hashes here.

yreport 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📊 yreport

🚀 Why yreport?

✨ Features

📦 Installation

Install from source (recommended)

🧠 Core Concept

📋 What the Report Includes

🧩 User Overrides

📤 Exporting Reports

🤖 scikit-learn Pipeline Integration

🧪 Testing

🧠 Design Philosophy

🚧 What yreport is NOT

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes