4-step AutoML pipeline for tabular data: model screening, SHAP feature selection, and Optuna HPO
Project description
H2ML
A 4-step AutoML pipeline for tabular data that wraps sklearn-compatible estimators. Given a feature matrix and target, it screens all registered models, reduces features via SHAP importance and correlation filtering, and tunes the winner with Optuna — all in one call.
Installation
pip install h2ml
# or
uv add h2ml
For boosting libraries (LightGBM, XGBoost, CatBoost):
pip install h2ml[boosting]
# or
uv add h2ml[boosting]
For spatial inference via h2ml.geo.geo_predict (requires h2mare):
pip install h2ml[geo]
# or
uv add h2ml[geo]
A runnable example using public sklearn datasets is in examples/quickstart.ipynb.
Quick start
import numpy as np
from h2ml import H2MLPipeline, PipelineConfig, PipelineData, TaskType
# Build the data container
store = PipelineData(
X=X_arr,
feature_names=feature_cols,
y=y_arr,
)
# Configure and run
pipeline = H2MLPipeline(config=PipelineConfig(
task_type=TaskType.CLASSIFICATION,
metric="AUC",
n_splits=5,
n_trials=50,
verbose=True,
))
result = pipeline.run(store)
# Inspect results
print(result.summary())
print(result.best_model_name, result.best_stage)
Regression with y-transform sweep
config = PipelineConfig(
task_type=TaskType.REGRESSION,
metric="R2",
verbose=True,
)
pipeline = H2MLPipeline(config=config)
result = pipeline.run(store, transforms=["log", "sqrt", "count", "winlog"])
Available transform names: "count" (identity), "log", "sqrt", "wincount", "winlog", "winsqrt". Winsorize-based transforms are skipped silently when no upper outliers are found.
Partial runs
# Screen models only (step 1)
result = pipeline.run_step1_only(store)
# Steps 1–2: run feature selection, then inspect before continuing
result = pipeline.run_step1_to_step2(store)
print(result.selector.importance_summary())
print(result.features_reduced.feature_names)
# Steps 1–3: full model and stage selection without HPO
result = pipeline.run_step1_to_step3(store)
# Resume from step 3 using a result that already has features_reduced
result = pipeline.run_from_step3(result)
# Re-run HPO only on a previously saved result (skips steps 1–3)
result = PipelineResult.load("runs/experiment_01")
result = pipeline.run_step4_only(result)
The 4-step pipeline
| Step | What happens | Key output on PipelineResult |
|---|---|---|
| 1 | K-fold CV all models (× optional y-transforms) on all features | best_model_name, step1_agg_df |
| 2 | Fit best model → SHAP importance → correlation-based feature drop | features_reduced, selector |
| 3 | K-fold CV all models on reduced features (winning transform only); compare vs step 1 | best_stage ("default" or "reduced") |
| 4 | Optuna HPO on the winning (model, stage, transform) | best_params, step4_agg_df |
Step 4 is skipped when the winning model has opt_enabled=False in the registry (e.g. LogisticRegression, GaussianNB, KNeighborsClassifier).
PipelineConfig reference
| Parameter | Default | Description |
|---|---|---|
task_type |
TaskType.CLASSIFICATION |
CLASSIFICATION or REGRESSION |
metric |
"AUC" |
Short metric name for model selection and HPO. Minimisation direction is derived automatically. Classification: "AUC", "AUC_PR", "F1", "LogLoss", "Brier". Regression: "R2", "MAE", "RMSE". |
n_splits |
5 |
Folds for steps 1 and 3 |
opt_n_splits |
3 |
Folds used inside Optuna (fewer = faster) |
corr_threshold |
0.7 |
Correlation threshold for dropping features in step 2. A feature is dropped if it exceeds this value in any of Pearson, Spearman, or Kendall correlation with a higher-ranked feature. |
n_trials |
50 |
Optuna trials in step 4 |
n_hpo_repeats |
1 |
Independent HPO runs with different fold seeds; best is kept |
min_features |
1 |
Minimum features retained after the correlation filter |
handle_imbalance |
False |
Inject class_weight="balanced" for supporting classifiers |
random_state |
42 |
Global seed |
verbose |
False |
Log step-by-step progress to stdout |
Spatial CV parameters
Set store.coords to an (n_samples, 2) array of spatial coordinates to activate spatial cross-validation. All parameters below are ignored when coords is None.
| Parameter | Default | Description |
|---|---|---|
spatial_cv_method |
"block" |
"block" (quantile-grid) or "spcv" (AHC + cluster ensemble) |
spatial_cv_metric |
"euclidean" |
"euclidean" or "haversine" (expects lat/lon in degrees) |
n_blocks_per_fold |
5 |
Blocks per test fold for the block splitter |
ahc_threshold |
None |
AHC distance threshold for spcv; auto-set to 10th percentile of pairwise distances when None |
exact_max_samples |
5000 |
n ≤ this → exact scipy AHC; n > → approximate sklearn AHC with k-NN graph |
knn_neighbors |
15 |
k for the k-NN connectivity graph in approximate AHC |
pca_components |
0.95 |
Variance retained by PCA on block covariates in spcv stage 2 |
Supported models
Classifiers — LogisticRegression, GaussianNB, KNeighborsClassifier, RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier, SVC, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, LGBMClassifier*, CatBoostClassifier*, XGBClassifier*
Regressors — PoissonRegressor, KNeighborsRegressor, RandomForestRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor, SVR, ExtraTreesRegressor, BaggingRegressor, AdaBoostRegressor, LGBMRegressor*, CatBoostRegressor*, XGBRegressor*
* Registered only when the package is installed. Custom models can be injected by passing a models list directly to H2MLPipeline.
PipelineResult
result.summary() # combined agg DataFrame across all completed stages
result.summary("AUC_Test_Mean") # sorted by metric
result.completed_steps # e.g. [1, 2, 3, 4]
result.best_model_name # winning model
result.best_stage # "default" | "reduced" | "optimized"
result.y_transform # winning y-transform (regression only)
result.cv_type # "spatial" | "random" — set from store.coords
result.cv_warnings # list of warning strings for models with failed folds
result.step1_agg_df # per-model mean/std metrics from step 1
result.features_reduced # PipelineData after feature selection
result.selector.importance_summary() # SHAP importances as a DataFrame
Exporting the final model
from h2ml.pipeline.final_model import FinalModel
final = result.build_final_model() # fits on full training set
final.predict(X_new)
final.predict_proba(X_new) # classification only
final.save("models/final.pkl")
final = FinalModel.load("models/final.pkl")
FinalModel.predict() accepts a DataFrame (columns aligned by name) or a numpy array (must match feature_names order).
Conformal prediction intervals
build_final_model() automatically calibrates a conformal predictor from the out-of-fold CV predictions — no held-out data required.
final = result.build_final_model()
# Regression — 90% prediction interval for each sample
lower, upper = final.predict_interval(X_new, alpha=0.10)
# Classification — prediction set for each sample
sets = final.predict_set(X_new, alpha=0.10)
# sets[i] == [1] → confident prediction of class 1
# sets[i] == [0] → confident prediction of class 0
# sets[i] == [0, 1] → uncertain; true label could be either
Both methods work on any input — held-out test samples, a prediction grid, spatial rasters, etc. The alpha parameter controls the miscoverage level: alpha=0.10 targets ≥ 90% coverage.
How it works: nonconformity scores (|y − ŷ| for regression, 1 − p(true class) for classification) are computed from the OOF folds and a single threshold q is stored. At inference time the interval is ŷ ± q (regression) or the set of classes with score ≤ q (classification).
Limitations:
- Intervals are constant-width — the same
qis added to every prediction, so regions of the input space with higher inherent variance get the same interval as low-variance regions. - Coverage is marginal, not conditional: the guarantee holds on average over new draws from the training distribution. Predictions on out-of-distribution inputs (e.g. spatial extrapolation beyond the training extent) may not achieve nominal coverage.
- If
result.y_transformis set, the interval is in the transformed space. ApplyINVERSE_TRANSFORMS[result.y_transform]to the bounds if you need original-scale intervals.
Persistence
from h2ml import PipelineResult
result.save("runs/experiment_01")
result = PipelineResult.load("runs/experiment_01")
DataFrames are serialised as Parquet, numpy arrays as .npy, and Python objects (selector, CV results) as joblib pickles under a single directory.
Comparing runs
from h2ml.evaluation.compare import compare_results
r1 = pipeline_a.run(store)
r2 = pipeline_b.run(store)
df = compare_results([r1, r2], labels=["baseline", "spatial_cv"], metric="AUC")
Returns a DataFrame with one row per result: Run, Metric, Best_Model, Best_Stage, Y_Transform, Score_Mean, Score_Std, Conservative_Bound (variance-penalised score), Brier_Mean, OOF_Brier, N_Features, Completed_Steps.
Visualization
from h2ml.plots.plots import (
pipeline_scores, # model scores across all three pipeline stages
cv_diagnostics, # classification or regression diagnostic panel
shap_importance, # horizontal bar chart of SHAP feature importances
shap_summary_plot, # SHAP beeswarm for the final best model
shap_dependence, # scatter + lowess for top-N features
)
pipeline_scores(result, save_path="plots/scores.png")
shap_importance(result.selector, save_path="plots/shap.png")
All functions accept an optional save_path; omit it to call plt.show() instead.
Spatial inference (h2mare integration)
h2ml.geo.geo_predict provides functions for spatial-temporal prediction on gridded data via the companion h2mare package:
from h2ml.geo.geo_predict import predict_map
predict_map(
model=final,
indexer=indexer, # h2mare.ParquetIndexer
dates=("2020-01", "2020-12"),
bbox=(lon_min, lat_min, lon_max, lat_max),
target_col="pm25",
agg_by="month",
save_path="maps/pm25_2020.png",
)
RunMetadata
Attach experiment labels to results for multi-run comparison:
from h2ml.evaluation.metrics import RunMetadata
pipeline = H2MLPipeline(
config=config,
metadata=RunMetadata(schema="v2_features", target="pm25", batch="2024-01"),
)
Labels appear as columns in all fold and agg DataFrames, making it easy to concatenate results across runs.
Contributing
Contributions are welcome. To set up a development environment:
git clone https://github.com/h2ugoparra/h2ml
cd h2ml
uv sync --group dev
uv run pytest
Please submit issues or pull requests on GitHub.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
This project was developed under the framework of COSTA project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file h2ml-0.1.0.tar.gz.
File metadata
- Download URL: h2ml-0.1.0.tar.gz
- Upload date:
- Size: 387.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b51e9374f94a02d78103bd04285b3801e1f35946d429b617b200a344ec4ce80
|
|
| MD5 |
679cd427f580c740e9db4f08ae5bc29e
|
|
| BLAKE2b-256 |
dab51dc194e0f684541437dd875f9612530ebdb53af38bd8dad7b1d5e8054406
|
Provenance
The following attestation bundles were made for h2ml-0.1.0.tar.gz:
Publisher:
release.yml on h2ugoparra/h2ml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
h2ml-0.1.0.tar.gz -
Subject digest:
3b51e9374f94a02d78103bd04285b3801e1f35946d429b617b200a344ec4ce80 - Sigstore transparency entry: 1462018034
- Sigstore integration time:
-
Permalink:
h2ugoparra/h2ml@bd4474d624ce21807a746d191a8a8bf7cb6f0b27 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/h2ugoparra
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bd4474d624ce21807a746d191a8a8bf7cb6f0b27 -
Trigger Event:
release
-
Statement type:
File details
Details for the file h2ml-0.1.0-py3-none-any.whl.
File metadata
- Download URL: h2ml-0.1.0-py3-none-any.whl
- Upload date:
- Size: 88.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d5797068aa24a240eb9c033fe68bef02faa6f8328d53d9473629af0c0e3bdd6
|
|
| MD5 |
85d8f1408c392c2dd569cbe1d29a88bf
|
|
| BLAKE2b-256 |
11229623211fc268109957c8aaddb70ea8ec66222354c574d3a799866539215b
|
Provenance
The following attestation bundles were made for h2ml-0.1.0-py3-none-any.whl:
Publisher:
release.yml on h2ugoparra/h2ml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
h2ml-0.1.0-py3-none-any.whl -
Subject digest:
6d5797068aa24a240eb9c033fe68bef02faa6f8328d53d9473629af0c0e3bdd6 - Sigstore transparency entry: 1462018038
- Sigstore integration time:
-
Permalink:
h2ugoparra/h2ml@bd4474d624ce21807a746d191a8a8bf7cb6f0b27 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/h2ugoparra
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bd4474d624ce21807a746d191a8a8bf7cb6f0b27 -
Trigger Event:
release
-
Statement type: