Skip to main content

Zero-friction AutoML + Data Cleaning Toolkit

Project description

🚀 KaizenStat

PyPI Version License: MIT Python Version Code Style: Black

KaizenStat is a zero-friction, production-grade AutoML, automated data cleaning, and model explanation engine. It allows you to audit datasets, repair data issues, benchmark models with hardware-aware optimization, export standalone pipeline code, and host web-based dashboards—all with a single command or Python import.


🎯 Core Philosophy

  • Zero-Friction AutoML: No complex configuration files. Pass your dataset, name your target, and KaizenStat does the rest.
  • Production Crash-Proofing: Automatically handles messy real-world data issues: high-cardinality ID columns, datetime parsing, missing inputs, class imbalance, and label encoding.
  • Explainable AI: Breaks open the "black box" by generating standalone, human-readable Python training code reproducing the best-found pipeline.
  • Hybrid Interface: 100% parity between CLI and Python API.

📦 Installation

Install the core package with zero heavy external dependencies:

pip install kaizenstat

Optional Drivers & Accelerators

Tailor KaizenStat to your specific workload:

pip install kaizenstat[ui]     # Install Streamlit for web dashboards
pip install kaizenstat[gpu]    # Install XGBoost with GPU/MPS support
pip install kaizenstat[fast]   # Install Polars for ultra-fast CSV parsing
pip install kaizenstat[all]    # Install all optional components

⚔️ CLI & Python API Feature Matrix

KaizenStat is designed around a single unified vocabulary. Every CLI command has a direct, equivalent function in the Python SDK.

Command Python API Purpose
kz audit KaizenStat.audit() 🔍 Runs a diagnostic sweep (missing values, duplicates, imbalance, dead features).
kz heal KaizenStat.heal() 🩹 Clean, impute, parse datetimes, drop IDs, and encode string labels.
kz benchmark KaizenStat.benchmark() 🚀 Automatically trains, optimizes, and ranks model pipelines.
kz auto KaizenStat.auto() ⚡ Orchestrates the entire pipeline in sequence (Audit ➔ Heal ➔ Benchmark).
kz explain KaizenStat.explain() 💬 Generates plain-English diagnostic summaries and model recommendations.
kz codegen KaizenStat.codegen() 📝 Generates standalone, dependency-free Python code for the best model.
kz export-model KaizenStat.save_model() 💾 Trains the top pipeline and saves it directly to a .joblib binary.
kz report KaizenStat.report() 📊 Generates a beautiful, interactive HTML profiling report with Chart.js.
kz serve KaizenStat.serve() 🌐 Launches a local web dashboard to explore the data and run predictions.
- KaizenStat.analyze() 🧠 Executes auto-intelligence analysis over dataset context using LLM reasoning.
- KaizenStat.ask() 🤖 Answers complex developer queries about accuracy, data quality, or anomalies.
- KaizenStat.ask_followup() 🔁 Maintains multi-turn conversation memory with the data reasoning engine.

💡 Quick Start Guide

1. Python SDK Usage

from kaizenstat import KaizenStat
import pandas as pd

# Load your dataset
df = pd.read_csv("dataset.csv")

# 1. Diagnose issues
findings = KaizenStat.audit(df, target="target_column")

# 2. Automatically heal dirty data
clean_df = KaizenStat.heal(df, target="target_column")

# 3. Benchmark models with cross-validation
leaderboard = KaizenStat.benchmark(clean_df, target="target_column")

# 4. Generate standalone code for reproduction
KaizenStat.codegen("dataset.csv", target="target_column", output_path="reproduce.py")

# 5. Dual-Mode Conversational AI (OpenRouter powered)
# Runs automated structured AI analysis
analysis = KaizenStat.analyze(df, target="target_column")

# Ask custom developer queries about data or pipeline
KaizenStat.ask("Why is model accuracy lower or what are the dataset flaws?")

# Multi-turn conversation with memory context
KaizenStat.ask_followup("What should I do to handle the missing values or high cardinality?")

2. Command Line Interface (CLI)

# Get quick help and list commands
kz --help

# Run the full pipeline in one command
kz auto dataset.csv --target target_column

# Repair a dataset and save the clean file
kz heal dataset.csv --target target_column -o cleaned_dataset.csv

# Launch a local Streamlit app to preview and test model performance
kz serve dataset.csv --target target_column --port 8501

🧠 Behind the Scenes: Core Engines

1. Hardware-Aware Execution

KaizenStat automatically checks your environment using detect_device(). It leverages CUDA on Nvidia GPUs and MPS on Apple Silicon (M1/M2/M3 Mac) to accelerate training when optional dependencies (like xgboost) are installed.

2. Smart Model Selection

The benchmarking engine adjusts its logic dynamically based on the dataset properties:

  • Large Datasets (>100k rows): Excludes slow estimators (like Gradient Boosting) on standard CPU hosts to prevent compute lockups.
  • High-Cardinality Categoricals: Optimizes feature preprocessors and prioritizes tree-based models (Random Forests, Gradient Boosting, XGBoost).
  • Float Targets: Detects values with a continuous numeric profile and switches the entire pipeline to regression mode automatically.

3. Automatic Imbalance Correction

During data healing, KaizenStat computes target ratios. If target class distribution has a skew larger than 65% / 35%, it adjusts model parameters dynamically (e.g. setting class_weight="balanced" in scikit-learn estimators).


🛠 Developer Guide

Setting up a local workspace

To contribute or run local enhancements:

  1. Clone the repository:
    git clone https://github.com/masuddarrahaman/KaizenStat-Library.git
    cd KaizenStat-Library
    
  2. Install the package in editable mode with all optional drivers:
    pip install -e ".[all]"
    
  3. Run tests or validation:
    python3 -m unittest discover -s tests
    

📄 License

Distributed under the MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kaizenstat-0.2.4.tar.gz (23.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kaizenstat-0.2.4-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file kaizenstat-0.2.4.tar.gz.

File metadata

  • Download URL: kaizenstat-0.2.4.tar.gz
  • Upload date:
  • Size: 23.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for kaizenstat-0.2.4.tar.gz
Algorithm Hash digest
SHA256 a1ca4bdf598ca55897e347e32329479f855363010234737150e25ce2d25c45c7
MD5 39be63da0cba5bd3f452af02748d0c39
BLAKE2b-256 0a772cb948cb80ad67fcfb32fa1a49c0340effc26096f9830a5e86e442201d31

See more details on using hashes here.

File details

Details for the file kaizenstat-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: kaizenstat-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for kaizenstat-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 6fcde8c96fa7e5072236e49548b36635a995f4c47b6f2a90b4d5ef242aae9dda
MD5 02f30feaea056a8ddef4045b14f3daa9
BLAKE2b-256 751e2ef479d808998c3a7015a5dfec386966bfc233f4eb08378bf1a6c7939f59

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page