End-to-End ML Pipeline — BigQuery / GCS / Vertex AI + Local CSV
Project description
pymlpipeline
End-to-End ML Pipeline — data cleaning, model training, evaluation and prediction.
Works with GCP (BigQuery / GCS / Vertex AI) and local CSV files from the same config.
Features
| Preprocessor | Model Builder | Predictor | |
|---|---|---|---|
| Data input | BigQuery, Local CSV, Demo | BigQuery, GCS, Local CSV | BigQuery, GCS, Local CSV |
| Output | Cleaned CSV + BQ table + GCS | Trained .pkl models + GCS |
Predictions CSV |
| Report | Word .docx preprocessing report |
Word .docx model evaluation report |
— |
| Environment | GCP or local (auto-detected) | GCP or local | GCP or local |
Preprocessor
- Reads from BigQuery (4 query modes) or local CSV
- Full column profile CSV uploaded to GCS before target selection
- Target encoding, stratified reload, identifier sidecar
- Keyword drop, high-null drop, dtype normalisation, imputation, outlier handling
- EDA charts, correlation filter, one-hot/label encoding, normalisation
- Writes output to BigQuery and/or local folder
Model Builder
- Reads from BigQuery output table or local CSV
- 81 models: sklearn (33 classifiers, 25 regressors, 23 clusterers) + XGBoost, LightGBM, CatBoost
- 5-method feature importance (MI, F-stat, Random Forest, Permutation, RFE)
- Correlation-based top-N feature selection
- Full evaluation: AUC-ROC, PR curve, MCC, Kappa, Log-Loss, Brier score, calibration plot, learning curve
- AI-generated training script via Gemini 2.5 Pro on Vertex AI (no API key)
- Saves all
.pklmodels +best_model.pkl+predict.pyto GCS and locally
Installation
# Core only (local CSV, no GCP)
pip install pymlpipeline
# With GCP support (BigQuery + GCS)
pip install "pymlpipeline[gcp]"
# With Vertex AI / Gemini code generation
pip install "pymlpipeline[gcp,vertex]"
# With XGBoost, LightGBM, CatBoost
pip install "pymlpipeline[gcp,vertex,boosting]"
# Everything
pip install "pymlpipeline[all]"
Python 3.10+ required.
Quick Start
1 · Initialise config
pymlpipeline init
# Creates pipeline_config.yaml in the current directory
# Edit it for your environment (see Configuration below)
2 · Preprocess data
pymlpipeline preprocess --config pipeline_config.yaml
Outputs:
ml_pipeline_output/2026-03-21_14-30-00/
profile/ column_profile.csv ← review this first
output/ processed_output.csv
report/ ML_Preprocessing_Report.docx
charts/ *.png
3 · Train models
pymlpipeline build --config pipeline_config.yaml
Outputs:
ml_model_output/2026-03-21_14-35-00/
models/ *.pkl best_model.pkl scaler.pkl predict.py
charts/ confusion matrix, ROC, PR, learning curve, calibration, ...
report/ ML_Model_Report.docx results.json
code/ model_training_code.py gemini_prompt.txt
4 · Predict on new data
# Local CSV
pymlpipeline predict \
--model ml_model_output/.../models/best_model.pkl \
--scaler ml_model_output/.../models/scaler.pkl \
--data new_data.csv
# BigQuery table
pymlpipeline predict \
--model models/best_model.pkl \
--scaler models/scaler.pkl \
--bq my-project.my_dataset.new_customers
# GCS file
pymlpipeline predict \
--model models/best_model.pkl \
--scaler models/scaler.pkl \
--gcs gs://my-bucket/data/new_data.csv
Configuration
A single pipeline_config.yaml controls both tools. Run pymlpipeline init to get a pre-filled template.
Environment
pipeline:
environment: "auto" # auto | gcp | local
data_source: "bigquery" # bigquery | csv | demo
environment |
Behaviour |
|---|---|
auto |
GCP if google-cloud-* + ADC credentials are available, otherwise local |
gcp |
Force GCP mode — fail clearly if libraries/credentials are missing |
local |
Skip all GCP calls; read/write local files only |
Local mode (no GCP)
pipeline:
environment: "local"
data_source: "csv"
local:
csv_path: "/path/to/your/data.csv" # single file
csv_folder: "" # or point to a folder (newest CSV used)
separator: ","
encoding: "utf-8"
GCP mode
pipeline:
environment: "gcp"
data_source: "bigquery"
bigquery:
project_id: "my-gcp-project"
dataset_id: "my_dataset"
table_id: "my_table"
query_mode: "full_table" # full_table | columns | filter | custom_sql
gcs:
bucket: "my-ml-bucket"
base_folder: "preprocessing/runs"
GCP authentication (no API key needed)
# Local development
gcloud auth application-default login
# CI / servers — set env var
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
# GCE / Cloud Run / GKE — automatic, no setup needed
Gemini 2.5 Pro code generation
gemini:
vertex_project: "my-gcp-project" # billing target
vertex_location: "us-central1"
No API key — uses ADC. Falls back to a static template if Vertex AI is unavailable.
Python API
Both tools can be used programmatically:
from pymlpipeline import run_pipeline, run_model_builder
from pymlpipeline import preprocessor_cfg, model_cfg
# Preprocessing
preprocessor_cfg.load("pipeline_config.yaml")
df_clean, df_ids, report_path = run_pipeline()
# Model building
model_cfg.load("pipeline_config.yaml")
run_model_builder()
Models Available
Classification (33 total + XGBoost/LightGBM/CatBoost when installed)
| Category | Models |
|---|---|
| 🚀 Boosting | Gradient Boosting, Hist GBM, AdaBoost, XGBoost, XGBoost(dart), LightGBM, LightGBM(DART/GOSS), CatBoost, CatBoost(balanced) |
| 🌲 Forest | Random Forest, Extra Trees |
| 📐 Linear | Logistic Regression (L1/L2), Ridge, SGD, Passive-Aggressive, Perceptron |
| ⚡ SVM | RBF, Linear, Poly, Nu-SVM, Linear SVC |
| 🧠 Neural | MLP (3 sizes) |
| 📍 KNN | k=3, k=5, k=11 |
| 📊 Naive Bayes | Gaussian, Bernoulli, Complement |
| Others | LDA, QDA, Gaussian Process, Label Spreading/Propagation |
Regression (25 + optional), Clustering (23 algorithms)
Full lists shown in the interactive model selection menu.
Supported Evaluation Metrics
Classification: Accuracy, Precision, Recall, F1 (weighted), ROC-AUC, Average Precision, MCC, Cohen's Kappa, Log-Loss, Brier Score, CV score
Regression: MAE, RMSE, R², MAPE, CV R²
Clustering: Silhouette, Calinski-Harabász, Davies-Bouldin
License
MIT — see LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymlpipeline-1.0.1.tar.gz.
File metadata
- Download URL: pymlpipeline-1.0.1.tar.gz
- Upload date:
- Size: 95.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9fcdbe845f7059d78901ef55a7aff3f116494756aefaee53a4128e7a765bc5b8
|
|
| MD5 |
bc7288a5f04fd89c6d8d9f549f5b5331
|
|
| BLAKE2b-256 |
7281674079554b4b6c6141b868121d1760284d981ccde1790839e6954b6cf17d
|
File details
Details for the file pymlpipeline-1.0.1-py3-none-any.whl.
File metadata
- Download URL: pymlpipeline-1.0.1-py3-none-any.whl
- Upload date:
- Size: 92.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8adb2b4e54d6a8116281c07f3cfa3099d85f73ec1bb4adb45a96795894ff34bb
|
|
| MD5 |
371ded8911ab055c466be2dfc087d947
|
|
| BLAKE2b-256 |
22e3e38eaab12bdde7333ac35696cf705589b2e7f6b8201a968116321b3ab4a9
|