End-to-End ML Pipeline — BigQuery / GCS / Vertex AI + Local CSV

These details have not been verified by PyPI

Project links

Project description

pymlpipeline

End-to-End ML Pipeline — data cleaning, model training, evaluation and prediction.
Works with GCP (BigQuery / GCS / Vertex AI) and local CSV files from the same config.

Features

	Preprocessor	Model Builder	Predictor
Data input	BigQuery, Local CSV, Demo	BigQuery, GCS, Local CSV	BigQuery, GCS, Local CSV
Output	Cleaned CSV + BQ table + GCS	Trained `.pkl` models + GCS	Predictions CSV
Report	Word `.docx` preprocessing report	Word `.docx` model evaluation report	—
Environment	GCP or local (auto-detected)	GCP or local	GCP or local

Preprocessor

Reads from BigQuery (4 query modes) or local CSV
Full column profile CSV uploaded to GCS before target selection
Target encoding, stratified reload, identifier sidecar
Keyword drop, high-null drop, dtype normalisation, imputation, outlier handling
EDA charts, correlation filter, one-hot/label encoding, normalisation
Writes output to BigQuery and/or local folder

Model Builder

Reads from BigQuery output table or local CSV
81 models: sklearn (33 classifiers, 25 regressors, 23 clusterers) + XGBoost, LightGBM, CatBoost
5-method feature importance (MI, F-stat, Random Forest, Permutation, RFE)
Correlation-based top-N feature selection
Full evaluation: AUC-ROC, PR curve, MCC, Kappa, Log-Loss, Brier score, calibration plot, learning curve
AI-generated training script via Gemini 2.5 Pro on Vertex AI (no API key)
Saves all .pkl models + best_model.pkl + predict.py to GCS and locally

Installation

# Core only (local CSV, no GCP)
pip install pymlpipeline

# With GCP support (BigQuery + GCS)
pip install "pymlpipeline[gcp]"

# With Vertex AI / Gemini code generation
pip install "pymlpipeline[gcp,vertex]"

# With XGBoost, LightGBM, CatBoost
pip install "pymlpipeline[gcp,vertex,boosting]"

# Everything
pip install "pymlpipeline[all]"

Python 3.10+ required.

Quick Start

1 · Initialise config

pymlpipeline init
# Creates pipeline_config.yaml in the current directory
# Edit it for your environment (see Configuration below)

2 · Preprocess data

pymlpipeline preprocess --config pipeline_config.yaml

Outputs:

ml_pipeline_output/2026-03-21_14-30-00/
  profile/   column_profile.csv          ← review this first
  output/    processed_output.csv
  report/    ML_Preprocessing_Report.docx
  charts/    *.png

3 · Train models

pymlpipeline build --config pipeline_config.yaml

Outputs:

ml_model_output/2026-03-21_14-35-00/
  models/    *.pkl  best_model.pkl  scaler.pkl  predict.py
  charts/    confusion matrix, ROC, PR, learning curve, calibration, ...
  report/    ML_Model_Report.docx  results.json
  code/      model_training_code.py  gemini_prompt.txt

4 · Predict on new data

# Local CSV
pymlpipeline predict \
  --model  ml_model_output/.../models/best_model.pkl \
  --scaler ml_model_output/.../models/scaler.pkl \
  --data   new_data.csv

# BigQuery table
pymlpipeline predict \
  --model  models/best_model.pkl \
  --scaler models/scaler.pkl \
  --bq     my-project.my_dataset.new_customers

# GCS file
pymlpipeline predict \
  --model  models/best_model.pkl \
  --scaler models/scaler.pkl \
  --gcs    gs://my-bucket/data/new_data.csv

Configuration

A single pipeline_config.yaml controls both tools. Run pymlpipeline init to get a pre-filled template.

Environment

pipeline:
  environment: "auto"    # auto | gcp | local
  data_source:  "bigquery"  # bigquery | csv | demo

`environment`	Behaviour
`auto`	GCP if `google-cloud-*` + ADC credentials are available, otherwise local
`gcp`	Force GCP mode — fail clearly if libraries/credentials are missing
`local`	Skip all GCP calls; read/write local files only

Local mode (no GCP)

pipeline:
  environment: "local"
  data_source:  "csv"

local:
  csv_path:   "/path/to/your/data.csv"   # single file
  csv_folder: ""                          # or point to a folder (newest CSV used)
  separator:  ","
  encoding:   "utf-8"

GCP mode

pipeline:
  environment: "gcp"
  data_source:  "bigquery"

bigquery:
  project_id:  "my-gcp-project"
  dataset_id:  "my_dataset"
  table_id:    "my_table"
  query_mode:  "full_table"   # full_table | columns | filter | custom_sql

gcs:
  bucket:      "my-ml-bucket"
  base_folder: "preprocessing/runs"

GCP authentication (no API key needed)

# Local development
gcloud auth application-default login

# CI / servers — set env var
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

# GCE / Cloud Run / GKE — automatic, no setup needed

Gemini 2.5 Pro code generation

gemini:
  vertex_project:  "my-gcp-project"   # billing target
  vertex_location: "us-central1"

No API key — uses ADC. Falls back to a static template if Vertex AI is unavailable.

Python API

Both tools can be used programmatically:

from pymlpipeline import run_pipeline, run_model_builder
from pymlpipeline import preprocessor_cfg, model_cfg

# Preprocessing
preprocessor_cfg.load("pipeline_config.yaml")
df_clean, df_ids, report_path = run_pipeline()

# Model building
model_cfg.load("pipeline_config.yaml")
run_model_builder()

Models Available

Classification (33 total + XGBoost/LightGBM/CatBoost when installed)

Category	Models
🚀 Boosting	Gradient Boosting, Hist GBM, AdaBoost, XGBoost, XGBoost(dart), LightGBM, LightGBM(DART/GOSS), CatBoost, CatBoost(balanced)
🌲 Forest	Random Forest, Extra Trees
📐 Linear	Logistic Regression (L1/L2), Ridge, SGD, Passive-Aggressive, Perceptron
⚡ SVM	RBF, Linear, Poly, Nu-SVM, Linear SVC
🧠 Neural	MLP (3 sizes)
📍 KNN	k=3, k=5, k=11
📊 Naive Bayes	Gaussian, Bernoulli, Complement
Others	LDA, QDA, Gaussian Process, Label Spreading/Propagation

Regression (25 + optional), Clustering (23 algorithms)

Full lists shown in the interactive model selection menu.

Supported Evaluation Metrics

Classification: Accuracy, Precision, Recall, F1 (weighted), ROC-AUC, Average Precision, MCC, Cohen's Kappa, Log-Loss, Brier Score, CV score
Regression: MAE, RMSE, R², MAPE, CV R²
Clustering: Silhouette, Calinski-Harabász, Davies-Bouldin

License

MIT — see LICENSE

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.1

Mar 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymlpipeline-1.0.1.tar.gz (95.2 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pymlpipeline-1.0.1-py3-none-any.whl (92.6 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file pymlpipeline-1.0.1.tar.gz.

File metadata

Download URL: pymlpipeline-1.0.1.tar.gz
Upload date: Mar 23, 2026
Size: 95.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pymlpipeline-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`9fcdbe845f7059d78901ef55a7aff3f116494756aefaee53a4128e7a765bc5b8`
MD5	`bc7288a5f04fd89c6d8d9f549f5b5331`
BLAKE2b-256	`7281674079554b4b6c6141b868121d1760284d981ccde1790839e6954b6cf17d`

See more details on using hashes here.

File details

Details for the file pymlpipeline-1.0.1-py3-none-any.whl.

File metadata

Download URL: pymlpipeline-1.0.1-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 92.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pymlpipeline-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8adb2b4e54d6a8116281c07f3cfa3099d85f73ec1bb4adb45a96795894ff34bb`
MD5	`371ded8911ab055c466be2dfc087d947`
BLAKE2b-256	`22e3e38eaab12bdde7333ac35696cf705589b2e7f6b8201a968116321b3ab4a9`

See more details on using hashes here.

pymlpipeline 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pymlpipeline

Features

Preprocessor

Model Builder

Installation

Quick Start

1 · Initialise config

2 · Preprocess data

3 · Train models

4 · Predict on new data

Configuration

Environment

Local mode (no GCP)

GCP mode

GCP authentication (no API key needed)

Gemini 2.5 Pro code generation

Python API

Models Available

Classification (33 total + XGBoost/LightGBM/CatBoost when installed)

Regression (25 + optional), Clustering (23 algorithms)

Supported Evaluation Metrics

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes