Skip to main content

Dataset-agnostic data health report for tabular datasets

Project description

📊 yreport

📦 Install

pip install yreport

yreport is a lightweight, dataset-agnostic data health reporting library for tabular datasets.
It analyzes data quality, detects potential issues, and provides honest, actionable diagnostics without making unsafe assumptions.

Unlike heavy EDA tools, yreport is designed to be:

  • Pipeline-friendly
  • Explainable
  • Configurable
  • Production-aware

🚀 Why yreport?

Most EDA libraries:

  • Generate large HTML reports
  • Make aggressive assumptions (e.g. one-hot everything)
  • Are hard to integrate into ML pipelines

yreport focuses on decisions, not decoration.

It helps answer:

  • Is this dataset usable?
  • Which columns are problematic?
  • What should be fixed first?
  • Where should I be careful before modeling?

✨ Features

  • Weighted Data Health Score (0–100)
  • Automatic column type detection
  • Missing value diagnostics with confidence levels
  • High-cardinality categorical detection
  • Numeric skewness and outlier analysis
  • Honest categorical handling (no forced one-hot / ordinal)
  • User override support
  • Non-contradictory recommendations
  • JSON and Markdown export
  • scikit-learn Pipeline integration
  • Lightweight and fast

📦 Installation

Install from source (recommended)

git clone https://github.com/Yogesh942134/yreport.git
cd yreport
pip install -e .

🧠 Core Concept

yreport does not modify your data.

It:

  • Inspects datasets
  • Reports potential issues
  • Suggests actions with confidence

It does not:

  • Apply transformations
  • Guess encoding methods
  • Perform feature engineering
  • This makes it safe and transparent.
import pandas as pd
from yreport import data_health_report

df = pd.read_csv("data.csv")

report = data_health_report(df)
report.summary()

Example Console Output: Data Health Score: 87.95/100 Rows: 891 | No_Columns: 12

Warnings:

  • high_missing: ['Cabin']
  • high_cardinality: ['Name', 'Ticket']

📋 What the Report Includes

1️⃣ Data Health Score A weighted score based on:

  • Missing values
  • Duplicate rows
  • High-cardinality features

2️⃣ Column Type Detection Automatically detects:

  • Numeric columns
  • Categorical columns
  • Datetime columns

3️⃣ Missing Value Diagnostics

  • Missing percentage per column
  • Drop or impute recommendations
  • Confidence levels: HIGH / MEDIUM

4️⃣ Categorical Diagnostics

  • Flags categorical columns that require encoding
  • Detects high-cardinality features
  • Does not assume one-hot or ordinal encoding

5️⃣ Numeric Diagnostics For numeric columns:

  • Skewness
  • Outlier percentage (IQR method)
  • Transform suggestions (log / robust)

🧩 User Overrides

Automatic detection is never perfect. yreport allows explicit user control.

Supported Overrides:

data_health_report(
    df,
    ignore_cols=[...],
    drop_cols=[...],
    categorical_cols=[...],
    numerical_cols=[...]
)

Meaning of Overides:

Override Purpose
ignore_cols Completely ignore columns
drop_cols Force drop columns
categorical_cols Force categorical treatment
numerical_cols Force numeric treatment

Rules:

  • User intent always overrides automation
  • A column belongs to only one semantic type
  • Ignored or dropped columns are excluded everywhere

📤 Exporting Reports

JSON Export (machine-readable):

report.to_json("report.json")/data = report.to_json()

Markdown Export (human-readable)

report.to_markdown("report.md")

🤖 scikit-learn Pipeline Integration

yreport provides a no-op sklearn inspector. Why?

  • Observe data during training
  • Do not interfere with models
  • Keep pipelines clean

Example

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from yreport import YReportInspector

pipe = Pipeline([
    ("inspect", YReportInspector(
        categorical_cols=["Pclass"],
        ignore_cols=["Name"]
    )),
    ("model", LogisticRegression(max_iter=1000))
])

pipe.fit(X_train, y_train)

pipe.named_steps["inspect"].report_.summary()
pipe.named_steps["inspect"].report_.to_markdown("train_report.md")

✔ Model trains normally ✔ Data remains unchanged ✔ Report is available after fit()

🧪 Testing

Run tests from the project root:

pytest

Includes:

  • sklearn pipeline compatibility test
  • Core API regression protection

🧠 Design Philosophy

  • Correctness > Automation
  • Transparency > Guessing
  • Diagnostics > Decoration
  • User intent > Heuristics

yreport will never silently apply transformations.

🚧 What yreport is NOT

  • AutoML tool
  • Feature engineering pipeline
  • Visualization-heavy EDA
  • Encoding decision engine

This is intentional.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yreport-0.1.3.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yreport-0.1.3-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file yreport-0.1.3.tar.gz.

File metadata

  • Download URL: yreport-0.1.3.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for yreport-0.1.3.tar.gz
Algorithm Hash digest
SHA256 5a1ce75793f60e6aa0c4f14a92cf00efe9fcb23fda4cf025a5aee24ced9f00cb
MD5 4b88cc13cd55343989b9b3c80e7d2603
BLAKE2b-256 9b0a5c056a162434a0d5d5abdd422aea192d103363a2d93bfe14614cac5b8bf5

See more details on using hashes here.

File details

Details for the file yreport-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: yreport-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for yreport-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9714e37e71e21df68f2d0ee73aa563720e7ed5a5aa5b5ecc5c600b9ac8ea0751
MD5 4757e72f0c53c11f176ea0cb248ffe49
BLAKE2b-256 14d7bfa33da5580e89221c92c5ab5aad9c2d87ac29301acca0cc5c6c2f0b2cd6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page