Automated end-to-end ML pipeline: clean → EDA → feature selection → train 12/13 models → HTML reports + CSV snapshots

These details have not been verified by PyPI

Project description

ds-eval-kit

Automated end-to-end Machine Learning Pipeline — one import, one call.

ds-eval-kit takes a raw dataset and delivers cleaned data, interactive HTML reports, and trained model comparisons — plus a full CSV trail of every pipeline stage — without writing a single line of boilerplate.

✨ Features

Stage	What it does	Output
Load	CSV · Excel · JSON · Parquet	—
Clean	Null imputation · duplicate removal · type coercion	`02_cleaned_data.csv`
EDA	Distribution · correlation · missing-value charts	`eda_report.html`
Encode	One-hot / label / ordinal (auto-detected)	—
Outliers	IQR / Z-score clipping	—
Scale	Standard · MinMax · Robust	`03_encoded_scaled_data.csv`
Feature selection	Correlation + RFE + importance (union)	`04_selected_features.csv` · `feature_selection_report.html`
Split	Stratified train/test	`05_train_set.csv` · `06_test_set.csv`
Train	12 classifiers or 13 regressors with cross-validation	`07_model_results.csv`
Report	Side-by-side model comparison with confusion matrices / residuals	`model_accuracy_report.html`

📦 Installation

pip install ds-eval-kit

Requirements

Python ≥ 3.12 and the following packages (all installed automatically):

pandas>=2.0   numpy>=1.26   scikit-learn>=1.3   matplotlib>=3.7
seaborn>=0.12  plotly>=5.15  scipy>=1.11  statsmodels>=0.14
xgboost>=2.0  lightgbm>=4.0  catboost>=1.2  pyarrow>=14.0  openpyxl>=3.1

🚀 Quickstart

Interactive mode (zero code)

from ds_eval_kit import ml_process

pipe = ml_process(output_dir="ml_output")
pipe.get_ml()
# → prompts: dataset path, target column, classification or regression

Programmatic mode

from ds_eval_kit import ml_process

pipe = ml_process(
    output_dir="ml_output",   # all files saved here
    export_csv=True,           # save 7 CSV snapshots (default: True)
    scaling_method="standard",
    handle_outliers=True,
)

result = pipe.run(
    dataset_path="used_cars.csv",
    target="price",
    problem_type="regression",   # or "classification" or None (auto-detect)
)

# HTML reports
print(result["eda_report"])       # → ml_output/eda_report.html
print(result["feature_report"])   # → ml_output/feature_selection_report.html
print(result["model_report"])     # → ml_output/model_accuracy_report.html

# CSV snapshots
for name, path in result["csv_files"].items():
    print(f"{name}  →  {path}")

# Intermediate DataFrames (in memory)
raw_df       = result["dataframes"]["raw"]
clean_df     = result["dataframes"]["cleaned"]
processed_df = result["dataframes"]["processed"]
selected_df  = result["dataframes"]["selected"]
train_df     = result["dataframes"]["train"]
test_df      = result["dataframes"]["test"]

# Model metrics table
import pandas as pd
metrics = pd.DataFrame(result["results"])
print(metrics.sort_values("R2", ascending=False))

📂 Output Files

After running the pipeline you will find the following files inside output_dir:

ml_output/
├── 01_raw_data.csv                  ← original loaded dataset
├── 02_cleaned_data.csv              ← after null/duplicate/type fixes
├── 03_encoded_scaled_data.csv       ← after encoding + outlier clip + scaling
├── 04_selected_features.csv         ← after feature selection
├── 05_train_set.csv                 ← X_train + y (target) column
├── 06_test_set.csv                  ← X_test  + y (target) column
├── 07_model_results.csv             ← all model metrics in one table
├── eda_report.html                  ← interactive EDA charts
├── feature_selection_report.html    ← feature importance & VIF table
└── model_accuracy_report.html       ← model comparison dashboard

CSV snapshot reference

File	Stage	Key columns
`01_raw_data.csv`	Raw load	original columns
`02_cleaned_data.csv`	After cleaning	original columns, nulls filled
`03_encoded_scaled_data.csv`	After preprocessing	numeric columns only
`04_selected_features.csv`	After feature selection	selected columns + target
`05_train_set.csv`	Train split	features + target
`06_test_set.csv`	Test split	features + target
`07_model_results.csv`	Model results	Model, CV Score, Test metrics…

⚙️ Configuration Reference

ml_process(
    test_size               = 0.2,          # fraction held out for testing
    random_state            = 42,           # global seed
    cv_folds                = 5,            # cross-validation folds
    handle_outliers         = True,         # IQR clipping
    scaling_method          = "standard",   # "standard" | "minmax" | "robust"
    encoding_method         = "auto",       # "auto" | "onehot" | "label" | "ordinal"
    feature_selection_method= "all",        # "all" | "correlation" | "rfe" | "importance"
    generate_plots          = True,         # include charts in HTML reports
    output_dir              = ".",          # where to save all output files
    export_csv              = True,         # save CSV snapshots at every stage
)

`scaling_method`

Value	Algorithm	Best for
`"standard"`	StandardScaler (z-score)	Most cases, SVM, linear models
`"minmax"`	MinMaxScaler (0–1)	Neural networks, KNN
`"robust"`	RobustScaler (median/IQR)	Data with many outliers

`encoding_method`

Value	Behaviour
`"auto"`	One-hot for ≤ 10 categories, label encoding otherwise
`"onehot"`	Always one-hot
`"label"`	Always label encoding
`"ordinal"`	Ordinal encoding (preserves order)

`feature_selection_method`

Value	Behaviour
`"all"`	Runs correlation + importance, unions results, then applies RFE
`"correlation"`	Drops features with pairwise correlation > 0.90
`"rfe"`	Recursive Feature Elimination with a RandomForest estimator
`"importance"`	Keeps the top-k features by RandomForest importance score

🤖 Models Trained

Classification (12 models)

Model	Library
Logistic Regression	scikit-learn
Decision Tree	scikit-learn
Random Forest	scikit-learn
Gradient Boosting	scikit-learn
AdaBoost	scikit-learn
Extra Trees	scikit-learn
SVM (RBF kernel)	scikit-learn
K-Nearest Neighbours	scikit-learn
Gaussian Naïve Bayes	scikit-learn
XGBoost	xgboost
LightGBM	lightgbm
CatBoost	catboost

Metrics: Accuracy · Precision · Recall · F1 · CV Score
Extras: Confusion matrix per model

Regression (13 models)

All of the above (minus Naïve Bayes) + Linear Regression · Ridge · Lasso

Metrics: R² · MAE · RMSE · CV Score
Extras: Actual vs Predicted scatter per model

📊 Accessing Results Programmatically

result = pipe.run("data.csv", "target")

# Best classification model by test accuracy
import pandas as pd
df_res = pd.DataFrame(result["results"])
best = df_res.sort_values("Accuracy", ascending=False).iloc[0]
print(f"Best model: {best['Model']}  accuracy={best['Accuracy']:.4f}")

# Load the cleaned CSV for further work
clean = pd.read_csv(result["csv_files"]["02_cleaned_data.csv"])

# Use the train DataFrame directly (no disk I/O)
train_df = result["dataframes"]["train"]

Disabling CSV export

pipe = ml_process(export_csv=False)   # HTML reports only, no CSVs

📁 Supported Dataset Formats

Extension	Format
`.csv`	Comma-separated values
`.xlsx` / `.xls`	Microsoft Excel
`.json`	JSON (records or columns orientation)
`.parquet`	Apache Parquet

🧪 Running Tests

pip install ds-eval-kit[dev]
pytest ds_eval_kit/tests/ -v

📖 Examples

Titanic (classification)

from ds_eval_kit import ml_process

pipe = ml_process(output_dir="titanic_output", export_csv=True)
result = pipe.run("titanic.csv", target="Survived", problem_type="classification")

House prices (regression)

from ds_eval_kit import ml_process

pipe = ml_process(
    output_dir="house_output",
    scaling_method="robust",
    feature_selection_method="importance",
)
result = pipe.run("house_prices.csv", target="SalePrice", problem_type="regression")

Custom split & folds

pipe = ml_process(test_size=0.25, cv_folds=10, random_state=0)
result = pipe.run("data.csv", "label")

📝 License

MIT — see LICENSE.

🙌 Contributing

Pull requests are welcome. Please run ruff check . and black . before submitting.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.0

Jun 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ds_eval_kit-1.1.0-py3-none-any.whl (37.3 kB view details)

Uploaded Jun 28, 2026 Python 3

File details

Details for the file ds_eval_kit-1.1.0-py3-none-any.whl.

File metadata

Download URL: ds_eval_kit-1.1.0-py3-none-any.whl
Upload date: Jun 28, 2026
Size: 37.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ds_eval_kit-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9140db4d8fb8695ee4652d924f55a070ef7fc9a04d2649ba5bd39c05a9674258`
MD5	`bee92b88aa37d0146717d6720afdfab3`
BLAKE2b-256	`a241781d29bf83c6a4fd28b9b5579a383e2dba7b51765dd8843bd29d34bcec7e`

See more details on using hashes here.

ds-eval-kit 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers