Data pipeline: extract → classify → clean → split -> Model

Project description

Model Pipeline (oxigen-pipeline)

Pipeline de ML para regresión tabular orientado a calidad de aire (AQI) u otras variables continuas. Incluye: ingesta → limpieza → split → entrenamiento (con tuning) → evaluación → reporte HTML (métricas + SHAP en base64) → diagrama del pipeline.

Repo: Danval-003/model_pipeline. ([GitHub][1])

Características

CLI simple para correr el pipeline sobre un .csv.
Preprocesamiento embebido con ColumnTransformer:
- StandardScaler para numéricas.
- OneHotEncoder(handle_unknown="ignore") (salida densa) para categóricas.
Modelos de regresión compatibles: RandomForest, Gradient Boosting, XGBoost, LightGBM.
Tuning con RandomizedSearchCV (cv=3).
SHAP con background consistente y gráfico embebido en base64 dentro del HTML.
Diagrama del pipeline en pipeline_diagram.html.

Instalación

pip install -U oxigen-pipeline

Alternativa: pip install -U --no-cache-dir oxigen-pipeline

Uso (CLI)

oxigen-pipeline --data-path final_dataset.csv --target AQI

--data-path : ruta al CSV.
--target : nombre de la columna objetivo (ej. AQI).

El comando imprime métricas y genera:

model_report.html (métricas de test + gráfico SHAP embebido en base64).
pipeline_diagram.html (diagrama del pipeline).

Si tu versión incluye selector de modelo por CLI, podés pasar --model RandomForest|GBM|XGBoost|LightGBM. Si no, podés elegir el modelo vía API (ver abajo).

Datos de ejemplo

El pipeline asume un CSV con columnas numéricas y/o categóricas y una target continua. Ejemplo típico (AQI):

Date,Month,Year,Holidays_Count,Days,PM2.5,PM10,NO2,SO2,CO,Ozone,AQI
15.0,5.0,2022,1,7,118.54,257.73,3.57,24.06,1.24,33.43,295.0
...

Uso (API en Python)

import pandas as pd
from sklearn.model_selection import train_test_split
from oxigen_pipeline.model import train_and_evaluate_model

df = pd.read_csv("final_dataset.csv")

target = "AQI"
X = df.drop(columns=[target])
y = df[target]

# split train/val/test (ejemplo simple 60/20/20)
X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.40, random_state=42)
X_val,   X_test, y_val, y_test = train_test_split(X_tmp, y_tmp, test_size=0.50, random_state=42)

best_model, best_params, metrics = train_and_evaluate_model(
    X_train, X_val, X_test, y_train, y_val, y_test,
    html_output_path="model_report.html",
    model_name="RandomForest"   # "GBM" | "XGBoost" | "LightGBM"
)

print(best_params, metrics)

¿Qué hace internamente?

Detecta tipos: numéricas (np.number) y categóricas (object/category).
Arma ColumnTransformer(num=StandardScaler, cat=OneHotEncoder(handle_unknown="ignore", salida densa)).
Pipeline: pre → model.
Tuning con RandomizedSearchCV y selección por R² de validación.
Evalúa en test y arma model_report.html con:
- MSE, MAE, R².
- SHAP summary embebido en base64 (para que siempre se vea, sin archivos extra).
Exporta también pipeline_diagram.html.

Requisitos

Python >=3.10
Dependencias principales (se instalan con el paquete):
- pandas, numpy, scikit-learn, matplotlib, shap, xgboost, lightgbm, joblib.

Desarrollo

git clone https://github.com/Danval-003/model_pipeline.git
cd model_pipeline
python -m venv .venv && .\.venv\Scripts\activate
pip install -U pip build
pip install -e .        # modo editable

Build y publicación:

python -m build
# subir con twine (o workflow de GitHub Actions)

Project details

Release history Release notifications | RSS feed

1.5.0

Sep 1, 2025

This version

1.4.0

Aug 30, 2025

1.3.0

Aug 30, 2025

1.2.0

Aug 30, 2025

1.1.0

Aug 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oxigen_pipeline-1.4.0.tar.gz (10.1 kB view details)

Uploaded Aug 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oxigen_pipeline-1.4.0-py3-none-any.whl (10.4 kB view details)

Uploaded Aug 30, 2025 Python 3

File details

Details for the file oxigen_pipeline-1.4.0.tar.gz.

File metadata

Download URL: oxigen_pipeline-1.4.0.tar.gz
Upload date: Aug 30, 2025
Size: 10.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for oxigen_pipeline-1.4.0.tar.gz
Algorithm	Hash digest
SHA256	`077a79c87b6e10515785386300d25396b8cb78ed42327ae68b31d90c88437fcc`
MD5	`c0e2fed978363b442ff68525d0ff3d2d`
BLAKE2b-256	`0a0762bf9a78a85001b637b5f93560261648b6e21659e241efff8ae8920a065e`

See more details on using hashes here.

File details

Details for the file oxigen_pipeline-1.4.0-py3-none-any.whl.

File metadata

Download URL: oxigen_pipeline-1.4.0-py3-none-any.whl
Upload date: Aug 30, 2025
Size: 10.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for oxigen_pipeline-1.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2156948ca3c662ffb05f15953e0f1431108ae3e2c0511d69ad2e6e02ac19eae0`
MD5	`ba8de3d17def6acbd2064fdccecf9867`
BLAKE2b-256	`e54bb48c380f7980ccbda285a82e37615250ca66f182b3b65be0ab55ea78ee43`

See more details on using hashes here.

oxigen-pipeline 1.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Model Pipeline (oxigen-pipeline)

Características

Instalación

Uso (CLI)

Datos de ejemplo

Uso (API en Python)

¿Qué hace internamente?

Requisitos

Desarrollo

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes