Skip to main content

End-to-End ML Pipeline — BigQuery / GCS / Vertex AI + Local CSV

Project description

pymlpipeline

End-to-End ML Pipeline — data cleaning, model training, evaluation and prediction.
Works with GCP (BigQuery / GCS / Vertex AI) and local CSV files from the same config.


Features

Preprocessor Model Builder Predictor
Data input BigQuery, Local CSV, Demo BigQuery, GCS, Local CSV BigQuery, GCS, Local CSV
Output Cleaned CSV + BQ table + GCS Trained .pkl models + GCS Predictions CSV
Report Word .docx preprocessing report Word .docx model evaluation report
Environment GCP or local (auto-detected) GCP or local GCP or local

Preprocessor

  • Reads from BigQuery (4 query modes) or local CSV
  • Full column profile CSV uploaded to GCS before target selection
  • Target encoding, stratified reload, identifier sidecar
  • Keyword drop, high-null drop, dtype normalisation, imputation, outlier handling
  • EDA charts, correlation filter, one-hot/label encoding, normalisation
  • Writes output to BigQuery and/or local folder

Model Builder

  • Reads from BigQuery output table or local CSV
  • 81 models: sklearn (33 classifiers, 25 regressors, 23 clusterers) + XGBoost, LightGBM, CatBoost
  • 5-method feature importance (MI, F-stat, Random Forest, Permutation, RFE)
  • Correlation-based top-N feature selection
  • Full evaluation: AUC-ROC, PR curve, MCC, Kappa, Log-Loss, Brier score, calibration plot, learning curve
  • AI-generated training script via Gemini 2.5 Pro on Vertex AI (no API key)
  • Saves all .pkl models + best_model.pkl + predict.py to GCS and locally

Installation

# Core only (local CSV, no GCP)
pip install pymlpipeline

# With GCP support (BigQuery + GCS)
pip install "pymlpipeline[gcp]"

# With Vertex AI / Gemini code generation
pip install "pymlpipeline[gcp,vertex]"

# With XGBoost, LightGBM, CatBoost
pip install "pymlpipeline[gcp,vertex,boosting]"

# Everything
pip install "pymlpipeline[all]"

Python 3.10+ required.


Quick Start

1 · Initialise config

pymlpipeline init
# Creates pipeline_config.yaml in the current directory
# Edit it for your environment (see Configuration below)

2 · Preprocess data

pymlpipeline preprocess --config pipeline_config.yaml

Outputs:

ml_pipeline_output/2026-03-21_14-30-00/
  profile/   column_profile.csv          ← review this first
  output/    processed_output.csv
  report/    ML_Preprocessing_Report.docx
  charts/    *.png

3 · Train models

pymlpipeline build --config pipeline_config.yaml

Outputs:

ml_model_output/2026-03-21_14-35-00/
  models/    *.pkl  best_model.pkl  scaler.pkl  predict.py
  charts/    confusion matrix, ROC, PR, learning curve, calibration, ...
  report/    ML_Model_Report.docx  results.json
  code/      model_training_code.py  gemini_prompt.txt

4 · Predict on new data

# Local CSV
pymlpipeline predict \
  --model  ml_model_output/.../models/best_model.pkl \
  --scaler ml_model_output/.../models/scaler.pkl \
  --data   new_data.csv

# BigQuery table
pymlpipeline predict \
  --model  models/best_model.pkl \
  --scaler models/scaler.pkl \
  --bq     my-project.my_dataset.new_customers

# GCS file
pymlpipeline predict \
  --model  models/best_model.pkl \
  --scaler models/scaler.pkl \
  --gcs    gs://my-bucket/data/new_data.csv

Configuration

A single pipeline_config.yaml controls both tools. Run pymlpipeline init to get a pre-filled template.

Environment

pipeline:
  environment: "auto"    # auto | gcp | local
  data_source:  "bigquery"  # bigquery | csv | demo
environment Behaviour
auto GCP if google-cloud-* + ADC credentials are available, otherwise local
gcp Force GCP mode — fail clearly if libraries/credentials are missing
local Skip all GCP calls; read/write local files only

Local mode (no GCP)

pipeline:
  environment: "local"
  data_source:  "csv"

local:
  csv_path:   "/path/to/your/data.csv"   # single file
  csv_folder: ""                          # or point to a folder (newest CSV used)
  separator:  ","
  encoding:   "utf-8"

GCP mode

pipeline:
  environment: "gcp"
  data_source:  "bigquery"

bigquery:
  project_id:  "my-gcp-project"
  dataset_id:  "my_dataset"
  table_id:    "my_table"
  query_mode:  "full_table"   # full_table | columns | filter | custom_sql

gcs:
  bucket:      "my-ml-bucket"
  base_folder: "preprocessing/runs"

GCP authentication (no API key needed)

# Local development
gcloud auth application-default login

# CI / servers — set env var
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

# GCE / Cloud Run / GKE — automatic, no setup needed

Gemini 2.5 Pro code generation

gemini:
  vertex_project:  "my-gcp-project"   # billing target
  vertex_location: "us-central1"

No API key — uses ADC. Falls back to a static template if Vertex AI is unavailable.


Python API

Both tools can be used programmatically:

from pymlpipeline import run_pipeline, run_model_builder
from pymlpipeline import preprocessor_cfg, model_cfg

# Preprocessing
preprocessor_cfg.load("pipeline_config.yaml")
df_clean, df_ids, report_path = run_pipeline()

# Model building
model_cfg.load("pipeline_config.yaml")
run_model_builder()

Models Available

Classification (33 total + XGBoost/LightGBM/CatBoost when installed)

Category Models
🚀 Boosting Gradient Boosting, Hist GBM, AdaBoost, XGBoost, XGBoost(dart), LightGBM, LightGBM(DART/GOSS), CatBoost, CatBoost(balanced)
🌲 Forest Random Forest, Extra Trees
📐 Linear Logistic Regression (L1/L2), Ridge, SGD, Passive-Aggressive, Perceptron
⚡ SVM RBF, Linear, Poly, Nu-SVM, Linear SVC
🧠 Neural MLP (3 sizes)
📍 KNN k=3, k=5, k=11
📊 Naive Bayes Gaussian, Bernoulli, Complement
Others LDA, QDA, Gaussian Process, Label Spreading/Propagation

Regression (25 + optional), Clustering (23 algorithms)

Full lists shown in the interactive model selection menu.


Supported Evaluation Metrics

Classification: Accuracy, Precision, Recall, F1 (weighted), ROC-AUC, Average Precision, MCC, Cohen's Kappa, Log-Loss, Brier Score, CV score
Regression: MAE, RMSE, R², MAPE, CV R²
Clustering: Silhouette, Calinski-Harabász, Davies-Bouldin


License

MIT — see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymlpipeline-1.0.1.tar.gz (95.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pymlpipeline-1.0.1-py3-none-any.whl (92.6 kB view details)

Uploaded Python 3

File details

Details for the file pymlpipeline-1.0.1.tar.gz.

File metadata

  • Download URL: pymlpipeline-1.0.1.tar.gz
  • Upload date:
  • Size: 95.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pymlpipeline-1.0.1.tar.gz
Algorithm Hash digest
SHA256 9fcdbe845f7059d78901ef55a7aff3f116494756aefaee53a4128e7a765bc5b8
MD5 bc7288a5f04fd89c6d8d9f549f5b5331
BLAKE2b-256 7281674079554b4b6c6141b868121d1760284d981ccde1790839e6954b6cf17d

See more details on using hashes here.

File details

Details for the file pymlpipeline-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pymlpipeline-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 92.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pymlpipeline-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8adb2b4e54d6a8116281c07f3cfa3099d85f73ec1bb4adb45a96795894ff34bb
MD5 371ded8911ab055c466be2dfc087d947
BLAKE2b-256 22e3e38eaab12bdde7333ac35696cf705589b2e7f6b8201a968116321b3ab4a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page