Automated end-to-end ML pipeline: clean → EDA → feature selection → train 12/13 models → HTML reports + CSV snapshots
Project description
ds-eval-kit
Automated end-to-end Machine Learning Pipeline — one import, one call.
ds-eval-kit takes a raw dataset and delivers cleaned data, interactive HTML reports, and trained model comparisons — plus a full CSV trail of every pipeline stage — without writing a single line of boilerplate.
✨ Features
| Stage | What it does | Output |
|---|---|---|
| Load | CSV · Excel · JSON · Parquet | — |
| Clean | Null imputation · duplicate removal · type coercion | 02_cleaned_data.csv |
| EDA | Distribution · correlation · missing-value charts | eda_report.html |
| Encode | One-hot / label / ordinal (auto-detected) | — |
| Outliers | IQR / Z-score clipping | — |
| Scale | Standard · MinMax · Robust | 03_encoded_scaled_data.csv |
| Feature selection | Correlation + RFE + importance (union) | 04_selected_features.csv · feature_selection_report.html |
| Split | Stratified train/test | 05_train_set.csv · 06_test_set.csv |
| Train | 12 classifiers or 13 regressors with cross-validation | 07_model_results.csv |
| Report | Side-by-side model comparison with confusion matrices / residuals | model_accuracy_report.html |
📦 Installation
pip install ds-eval-kit
Requirements
Python ≥ 3.12 and the following packages (all installed automatically):
pandas>=2.0 numpy>=1.26 scikit-learn>=1.3 matplotlib>=3.7
seaborn>=0.12 plotly>=5.15 scipy>=1.11 statsmodels>=0.14
xgboost>=2.0 lightgbm>=4.0 catboost>=1.2 pyarrow>=14.0 openpyxl>=3.1
🚀 Quickstart
Interactive mode (zero code)
from ds_eval_kit import ml_process
pipe = ml_process(output_dir="ml_output")
pipe.get_ml()
# → prompts: dataset path, target column, classification or regression
Programmatic mode
from ds_eval_kit import ml_process
pipe = ml_process(
output_dir="ml_output", # all files saved here
export_csv=True, # save 7 CSV snapshots (default: True)
scaling_method="standard",
handle_outliers=True,
)
result = pipe.run(
dataset_path="used_cars.csv",
target="price",
problem_type="regression", # or "classification" or None (auto-detect)
)
# HTML reports
print(result["eda_report"]) # → ml_output/eda_report.html
print(result["feature_report"]) # → ml_output/feature_selection_report.html
print(result["model_report"]) # → ml_output/model_accuracy_report.html
# CSV snapshots
for name, path in result["csv_files"].items():
print(f"{name} → {path}")
# Intermediate DataFrames (in memory)
raw_df = result["dataframes"]["raw"]
clean_df = result["dataframes"]["cleaned"]
processed_df = result["dataframes"]["processed"]
selected_df = result["dataframes"]["selected"]
train_df = result["dataframes"]["train"]
test_df = result["dataframes"]["test"]
# Model metrics table
import pandas as pd
metrics = pd.DataFrame(result["results"])
print(metrics.sort_values("R2", ascending=False))
📂 Output Files
After running the pipeline you will find the following files inside output_dir:
ml_output/
├── 01_raw_data.csv ← original loaded dataset
├── 02_cleaned_data.csv ← after null/duplicate/type fixes
├── 03_encoded_scaled_data.csv ← after encoding + outlier clip + scaling
├── 04_selected_features.csv ← after feature selection
├── 05_train_set.csv ← X_train + y (target) column
├── 06_test_set.csv ← X_test + y (target) column
├── 07_model_results.csv ← all model metrics in one table
├── eda_report.html ← interactive EDA charts
├── feature_selection_report.html ← feature importance & VIF table
└── model_accuracy_report.html ← model comparison dashboard
CSV snapshot reference
| File | Stage | Key columns |
|---|---|---|
01_raw_data.csv |
Raw load | original columns |
02_cleaned_data.csv |
After cleaning | original columns, nulls filled |
03_encoded_scaled_data.csv |
After preprocessing | numeric columns only |
04_selected_features.csv |
After feature selection | selected columns + target |
05_train_set.csv |
Train split | features + target |
06_test_set.csv |
Test split | features + target |
07_model_results.csv |
Model results | Model, CV Score, Test metrics… |
⚙️ Configuration Reference
ml_process(
test_size = 0.2, # fraction held out for testing
random_state = 42, # global seed
cv_folds = 5, # cross-validation folds
handle_outliers = True, # IQR clipping
scaling_method = "standard", # "standard" | "minmax" | "robust"
encoding_method = "auto", # "auto" | "onehot" | "label" | "ordinal"
feature_selection_method= "all", # "all" | "correlation" | "rfe" | "importance"
generate_plots = True, # include charts in HTML reports
output_dir = ".", # where to save all output files
export_csv = True, # save CSV snapshots at every stage
)
scaling_method
| Value | Algorithm | Best for |
|---|---|---|
"standard" |
StandardScaler (z-score) | Most cases, SVM, linear models |
"minmax" |
MinMaxScaler (0–1) | Neural networks, KNN |
"robust" |
RobustScaler (median/IQR) | Data with many outliers |
encoding_method
| Value | Behaviour |
|---|---|
"auto" |
One-hot for ≤ 10 categories, label encoding otherwise |
"onehot" |
Always one-hot |
"label" |
Always label encoding |
"ordinal" |
Ordinal encoding (preserves order) |
feature_selection_method
| Value | Behaviour |
|---|---|
"all" |
Runs correlation + importance, unions results, then applies RFE |
"correlation" |
Drops features with pairwise correlation > 0.90 |
"rfe" |
Recursive Feature Elimination with a RandomForest estimator |
"importance" |
Keeps the top-k features by RandomForest importance score |
🤖 Models Trained
Classification (12 models)
| Model | Library |
|---|---|
| Logistic Regression | scikit-learn |
| Decision Tree | scikit-learn |
| Random Forest | scikit-learn |
| Gradient Boosting | scikit-learn |
| AdaBoost | scikit-learn |
| Extra Trees | scikit-learn |
| SVM (RBF kernel) | scikit-learn |
| K-Nearest Neighbours | scikit-learn |
| Gaussian Naïve Bayes | scikit-learn |
| XGBoost | xgboost |
| LightGBM | lightgbm |
| CatBoost | catboost |
Metrics: Accuracy · Precision · Recall · F1 · CV Score
Extras: Confusion matrix per model
Regression (13 models)
All of the above (minus Naïve Bayes) + Linear Regression · Ridge · Lasso
Metrics: R² · MAE · RMSE · CV Score
Extras: Actual vs Predicted scatter per model
📊 Accessing Results Programmatically
result = pipe.run("data.csv", "target")
# Best classification model by test accuracy
import pandas as pd
df_res = pd.DataFrame(result["results"])
best = df_res.sort_values("Accuracy", ascending=False).iloc[0]
print(f"Best model: {best['Model']} accuracy={best['Accuracy']:.4f}")
# Load the cleaned CSV for further work
clean = pd.read_csv(result["csv_files"]["02_cleaned_data.csv"])
# Use the train DataFrame directly (no disk I/O)
train_df = result["dataframes"]["train"]
Disabling CSV export
pipe = ml_process(export_csv=False) # HTML reports only, no CSVs
📁 Supported Dataset Formats
| Extension | Format |
|---|---|
.csv |
Comma-separated values |
.xlsx / .xls |
Microsoft Excel |
.json |
JSON (records or columns orientation) |
.parquet |
Apache Parquet |
🧪 Running Tests
pip install ds-eval-kit[dev]
pytest ds_eval_kit/tests/ -v
📖 Examples
Titanic (classification)
from ds_eval_kit import ml_process
pipe = ml_process(output_dir="titanic_output", export_csv=True)
result = pipe.run("titanic.csv", target="Survived", problem_type="classification")
House prices (regression)
from ds_eval_kit import ml_process
pipe = ml_process(
output_dir="house_output",
scaling_method="robust",
feature_selection_method="importance",
)
result = pipe.run("house_prices.csv", target="SalePrice", problem_type="regression")
Custom split & folds
pipe = ml_process(test_size=0.25, cv_folds=10, random_state=0)
result = pipe.run("data.csv", "label")
📝 License
MIT — see LICENSE.
🙌 Contributing
Pull requests are welcome. Please run ruff check . and black . before submitting.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ds_eval_kit-1.1.0-py3-none-any.whl.
File metadata
- Download URL: ds_eval_kit-1.1.0-py3-none-any.whl
- Upload date:
- Size: 37.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9140db4d8fb8695ee4652d924f55a070ef7fc9a04d2649ba5bd39c05a9674258
|
|
| MD5 |
bee92b88aa37d0146717d6720afdfab3
|
|
| BLAKE2b-256 |
a241781d29bf83c6a4fd28b9b5579a383e2dba7b51765dd8843bd29d34bcec7e
|