Skip to main content

Automated end-to-end ML pipeline: clean → EDA → feature selection → train 12/13 models → HTML reports + CSV snapshots

Project description

ds-eval-kit

Automated end-to-end Machine Learning Pipeline — one import, one call.

ds-eval-kit takes a raw dataset and delivers cleaned data, interactive HTML reports, and trained model comparisons — plus a full CSV trail of every pipeline stage — without writing a single line of boilerplate.


✨ Features

Stage What it does Output
Load CSV · Excel · JSON · Parquet
Clean Null imputation · duplicate removal · type coercion 02_cleaned_data.csv
EDA Distribution · correlation · missing-value charts eda_report.html
Encode One-hot / label / ordinal (auto-detected)
Outliers IQR / Z-score clipping
Scale Standard · MinMax · Robust 03_encoded_scaled_data.csv
Feature selection Correlation + RFE + importance (union) 04_selected_features.csv · feature_selection_report.html
Split Stratified train/test 05_train_set.csv · 06_test_set.csv
Train 12 classifiers or 13 regressors with cross-validation 07_model_results.csv
Report Side-by-side model comparison with confusion matrices / residuals model_accuracy_report.html

📦 Installation

pip install ds-eval-kit

Requirements

Python ≥ 3.12 and the following packages (all installed automatically):

pandas>=2.0   numpy>=1.26   scikit-learn>=1.3   matplotlib>=3.7
seaborn>=0.12  plotly>=5.15  scipy>=1.11  statsmodels>=0.14
xgboost>=2.0  lightgbm>=4.0  catboost>=1.2  pyarrow>=14.0  openpyxl>=3.1

🚀 Quickstart

Interactive mode (zero code)

from ds_eval_kit import ml_process

pipe = ml_process(output_dir="ml_output")
pipe.get_ml()
# → prompts: dataset path, target column, classification or regression

Programmatic mode

from ds_eval_kit import ml_process

pipe = ml_process(
    output_dir="ml_output",   # all files saved here
    export_csv=True,           # save 7 CSV snapshots (default: True)
    scaling_method="standard",
    handle_outliers=True,
)

result = pipe.run(
    dataset_path="used_cars.csv",
    target="price",
    problem_type="regression",   # or "classification" or None (auto-detect)
)

# HTML reports
print(result["eda_report"])       # → ml_output/eda_report.html
print(result["feature_report"])   # → ml_output/feature_selection_report.html
print(result["model_report"])     # → ml_output/model_accuracy_report.html

# CSV snapshots
for name, path in result["csv_files"].items():
    print(f"{name}{path}")

# Intermediate DataFrames (in memory)
raw_df       = result["dataframes"]["raw"]
clean_df     = result["dataframes"]["cleaned"]
processed_df = result["dataframes"]["processed"]
selected_df  = result["dataframes"]["selected"]
train_df     = result["dataframes"]["train"]
test_df      = result["dataframes"]["test"]

# Model metrics table
import pandas as pd
metrics = pd.DataFrame(result["results"])
print(metrics.sort_values("R2", ascending=False))

📂 Output Files

After running the pipeline you will find the following files inside output_dir:

ml_output/
├── 01_raw_data.csv                  ← original loaded dataset
├── 02_cleaned_data.csv              ← after null/duplicate/type fixes
├── 03_encoded_scaled_data.csv       ← after encoding + outlier clip + scaling
├── 04_selected_features.csv         ← after feature selection
├── 05_train_set.csv                 ← X_train + y (target) column
├── 06_test_set.csv                  ← X_test  + y (target) column
├── 07_model_results.csv             ← all model metrics in one table
├── eda_report.html                  ← interactive EDA charts
├── feature_selection_report.html    ← feature importance & VIF table
└── model_accuracy_report.html       ← model comparison dashboard

CSV snapshot reference

File Stage Key columns
01_raw_data.csv Raw load original columns
02_cleaned_data.csv After cleaning original columns, nulls filled
03_encoded_scaled_data.csv After preprocessing numeric columns only
04_selected_features.csv After feature selection selected columns + target
05_train_set.csv Train split features + target
06_test_set.csv Test split features + target
07_model_results.csv Model results Model, CV Score, Test metrics…

⚙️ Configuration Reference

ml_process(
    test_size               = 0.2,          # fraction held out for testing
    random_state            = 42,           # global seed
    cv_folds                = 5,            # cross-validation folds
    handle_outliers         = True,         # IQR clipping
    scaling_method          = "standard",   # "standard" | "minmax" | "robust"
    encoding_method         = "auto",       # "auto" | "onehot" | "label" | "ordinal"
    feature_selection_method= "all",        # "all" | "correlation" | "rfe" | "importance"
    generate_plots          = True,         # include charts in HTML reports
    output_dir              = ".",          # where to save all output files
    export_csv              = True,         # save CSV snapshots at every stage
)

scaling_method

Value Algorithm Best for
"standard" StandardScaler (z-score) Most cases, SVM, linear models
"minmax" MinMaxScaler (0–1) Neural networks, KNN
"robust" RobustScaler (median/IQR) Data with many outliers

encoding_method

Value Behaviour
"auto" One-hot for ≤ 10 categories, label encoding otherwise
"onehot" Always one-hot
"label" Always label encoding
"ordinal" Ordinal encoding (preserves order)

feature_selection_method

Value Behaviour
"all" Runs correlation + importance, unions results, then applies RFE
"correlation" Drops features with pairwise correlation > 0.90
"rfe" Recursive Feature Elimination with a RandomForest estimator
"importance" Keeps the top-k features by RandomForest importance score

🤖 Models Trained

Classification (12 models)

Model Library
Logistic Regression scikit-learn
Decision Tree scikit-learn
Random Forest scikit-learn
Gradient Boosting scikit-learn
AdaBoost scikit-learn
Extra Trees scikit-learn
SVM (RBF kernel) scikit-learn
K-Nearest Neighbours scikit-learn
Gaussian Naïve Bayes scikit-learn
XGBoost xgboost
LightGBM lightgbm
CatBoost catboost

Metrics: Accuracy · Precision · Recall · F1 · CV Score
Extras: Confusion matrix per model

Regression (13 models)

All of the above (minus Naïve Bayes) + Linear Regression · Ridge · Lasso

Metrics: R² · MAE · RMSE · CV Score
Extras: Actual vs Predicted scatter per model


📊 Accessing Results Programmatically

result = pipe.run("data.csv", "target")

# Best classification model by test accuracy
import pandas as pd
df_res = pd.DataFrame(result["results"])
best = df_res.sort_values("Accuracy", ascending=False).iloc[0]
print(f"Best model: {best['Model']}  accuracy={best['Accuracy']:.4f}")

# Load the cleaned CSV for further work
clean = pd.read_csv(result["csv_files"]["02_cleaned_data.csv"])

# Use the train DataFrame directly (no disk I/O)
train_df = result["dataframes"]["train"]

Disabling CSV export

pipe = ml_process(export_csv=False)   # HTML reports only, no CSVs

📁 Supported Dataset Formats

Extension Format
.csv Comma-separated values
.xlsx / .xls Microsoft Excel
.json JSON (records or columns orientation)
.parquet Apache Parquet

🧪 Running Tests

pip install ds-eval-kit[dev]
pytest ds_eval_kit/tests/ -v

📖 Examples

Titanic (classification)

from ds_eval_kit import ml_process

pipe = ml_process(output_dir="titanic_output", export_csv=True)
result = pipe.run("titanic.csv", target="Survived", problem_type="classification")

House prices (regression)

from ds_eval_kit import ml_process

pipe = ml_process(
    output_dir="house_output",
    scaling_method="robust",
    feature_selection_method="importance",
)
result = pipe.run("house_prices.csv", target="SalePrice", problem_type="regression")

Custom split & folds

pipe = ml_process(test_size=0.25, cv_folds=10, random_state=0)
result = pipe.run("data.csv", "label")

📝 License

MIT — see LICENSE.


🙌 Contributing

Pull requests are welcome. Please run ruff check . and black . before submitting.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ds_eval_kit-1.1.0-py3-none-any.whl (37.3 kB view details)

Uploaded Python 3

File details

Details for the file ds_eval_kit-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: ds_eval_kit-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 37.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ds_eval_kit-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9140db4d8fb8695ee4652d924f55a070ef7fc9a04d2649ba5bd39c05a9674258
MD5 bee92b88aa37d0146717d6720afdfab3
BLAKE2b-256 a241781d29bf83c6a4fd28b9b5579a383e2dba7b51765dd8843bd29d34bcec7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page