Skip to main content

Dataset-agnostic data health report for tabular datasets

Project description

📊 yreport

yreport is a lightweight, dataset-agnostic data health reporting library for tabular datasets.
It analyzes data quality, detects potential issues, and provides honest, actionable diagnostics without making unsafe assumptions.

Unlike heavy EDA tools, yreport is designed to be:

  • Pipeline-friendly
  • Explainable
  • Configurable
  • Production-aware

🚀 Why yreport?

Most EDA libraries:

  • Generate large HTML reports
  • Make aggressive assumptions (e.g. one-hot everything)
  • Are hard to integrate into ML pipelines

yreport focuses on decisions, not decoration.

It helps answer:

  • Is this dataset usable?
  • Which columns are problematic?
  • What should be fixed first?
  • Where should I be careful before modeling?

✨ Features

  • Weighted Data Health Score (0–100)
  • Automatic column type detection
  • Missing value diagnostics with confidence levels
  • High-cardinality categorical detection
  • Numeric skewness and outlier analysis
  • Honest categorical handling (no forced one-hot / ordinal)
  • User override support
  • Non-contradictory recommendations
  • JSON and Markdown export
  • scikit-learn Pipeline integration
  • Lightweight and fast

📦 Installation

Install from source (recommended)

git clone https://github.com/your-username/yreport.git
cd yreport
pip install -e .

🧠 Core Concept

yreport does not modify your data.

It:

  • Inspects datasets
  • Reports potential issues
  • Suggests actions with confidence

It does not:

  • Apply transformations
  • Guess encoding methods
  • Perform feature engineering
  • This makes it safe and transparent.
import pandas as pd
from yreport import data_health_report

df = pd.read_csv("data.csv")

report = data_health_report(df)
report.summary()

Example Console Output: Data Health Score: 87.95/100 Rows: 891 | No_Columns: 12

Warnings:

  • high_missing: ['Cabin']
  • high_cardinality: ['Name', 'Ticket']

📋 What the Report Includes

1️⃣ Data Health Score A weighted score based on:

  • Missing values
  • Duplicate rows
  • High-cardinality features

2️⃣ Column Type Detection Automatically detects:

  • Numeric columns
  • Categorical columns
  • Datetime columns

3️⃣ Missing Value Diagnostics

  • Missing percentage per column
  • Drop or impute recommendations
  • Confidence levels: HIGH / MEDIUM

4️⃣ Categorical Diagnostics

  • Flags categorical columns that require encoding
  • Detects high-cardinality features
  • Does not assume one-hot or ordinal encoding

5️⃣ Numeric Diagnostics For numeric columns:

  • Skewness
  • Outlier percentage (IQR method)
  • Transform suggestions (log / robust)

🧩 User Overrides

Automatic detection is never perfect. yreport allows explicit user control.

Supported Overrides:

data_health_report(
    df,
    ignore_cols=[...],
    drop_cols=[...],
    categorical_cols=[...],
    numerical_cols=[...]
)

Meaning of Overides:

Override Purpose
ignore_cols Completely ignore columns
drop_cols Force drop columns
categorical_cols Force categorical treatment
numerical_cols Force numeric treatment

Rules:

  • User intent always overrides automation
  • A column belongs to only one semantic type
  • Ignored or dropped columns are excluded everywhere

📤 Exporting Reports

JSON Export (machine-readable):

report.to_json("report.json")/data = report.to_json()

Markdown Export (human-readable)

report.to_markdown("report.md")

🤖 scikit-learn Pipeline Integration

yreport provides a no-op sklearn inspector. Why?

  • Observe data during training
  • Do not interfere with models
  • Keep pipelines clean

Example

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from yreport import YReportInspector

pipe = Pipeline([
    ("inspect", YReportInspector(
        categorical_cols=["Pclass"],
        ignore_cols=["Name"]
    )),
    ("model", LogisticRegression(max_iter=1000))
])

pipe.fit(X_train, y_train)

pipe.named_steps["inspect"].report_.summary()
pipe.named_steps["inspect"].report_.to_markdown("train_report.md")

✔ Model trains normally ✔ Data remains unchanged ✔ Report is available after fit()

🧪 Testing

Run tests from the project root:

pytest

Includes:

  • sklearn pipeline compatibility test
  • Core API regression protection

🧠 Design Philosophy

  • Correctness > Automation
  • Transparency > Guessing
  • Diagnostics > Decoration
  • User intent > Heuristics

yreport will never silently apply transformations.

🚧 What yreport is NOT

  • AutoML tool
  • Feature engineering pipeline
  • Visualization-heavy EDA
  • Encoding decision engine

This is intentional.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yreport-0.1.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yreport-0.1.0-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file yreport-0.1.0.tar.gz.

File metadata

  • Download URL: yreport-0.1.0.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for yreport-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0979a9915dbbd0f1e1a85f41ba78a5bc67a052d98d544acae678d6ac07e3533c
MD5 9414287bed604cfd502275011c846dd7
BLAKE2b-256 37d69f536516e3db13b9db79973c96fdbba5556365a79ef35a030fe611d777c6

See more details on using hashes here.

File details

Details for the file yreport-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: yreport-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for yreport-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5d74b623b3f606b1b6a53b82696a693ca3d3369ac345ec033ff86192138127ad
MD5 70b571b3746d9b309daa1532fcad43f2
BLAKE2b-256 d5cf9c6df59671c6fb164004d3cebfa055253428e8ad7ab6a085aef7754f00d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page