4-step AutoML pipeline for tabular data: model screening, SHAP feature selection, and Optuna HPO

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

h2ugoparra

These details have not been verified by PyPI

Project description

H2ML

Python PyPI

A 4-step AutoML pipeline for tabular data that wraps sklearn-compatible estimators. Given a feature matrix and target, it screens all registered models, reduces features via SHAP importance and correlation filtering, and tunes the winner with Optuna — all in one call.

Installation

pip install h2ml
# or
uv add h2ml

For boosting libraries (LightGBM, XGBoost, CatBoost):

pip install h2ml[boosting]
# or
uv add h2ml[boosting]

For spatial inference via h2ml.geo.geo_predict (requires h2mare):

pip install h2ml[geo]
# or
uv add h2ml[geo]

A runnable example using public sklearn datasets is in examples/quickstart.ipynb.

Quick start

import numpy as np
from h2ml import H2MLPipeline, PipelineConfig, PipelineData, TaskType

# Build the data container
store = PipelineData(
    X=X_arr,
    feature_names=feature_cols,
    y=y_arr,
)

# Configure and run
pipeline = H2MLPipeline(config=PipelineConfig(
    task_type=TaskType.CLASSIFICATION,
    metric="AUC",
    n_splits=5,
    n_trials=50,
    verbose=True,
))
result = pipeline.run(store)

# Inspect results
print(result.summary())
print(result.best_model_name, result.best_stage)

Regression with y-transform sweep

config = PipelineConfig(
    task_type=TaskType.REGRESSION,
    metric="R2",
    verbose=True,
)
pipeline = H2MLPipeline(config=config)
result = pipeline.run(store, transforms=["log", "sqrt", "count", "winlog"])

Available transform names: "count" (identity), "log", "sqrt", "wincount", "winlog", "winsqrt". Winsorize-based transforms are skipped silently when no upper outliers are found.

Partial runs

# Screen models only (step 1)
result = pipeline.run_step1_only(store)

# Steps 1–2: run feature selection, then inspect before continuing
result = pipeline.run_step1_to_step2(store)
print(result.selector.importance_summary())
print(result.features_reduced.feature_names)

# Steps 1–3: full model and stage selection without HPO
result = pipeline.run_step1_to_step3(store)

# Resume from step 3 using a result that already has features_reduced
result = pipeline.run_from_step3(result)

# Re-run HPO only on a previously saved result (skips steps 1–3)
result = PipelineResult.load("runs/experiment_01")
result = pipeline.run_step4_only(result)

The 4-step pipeline

Step	What happens	Key output on `PipelineResult`
1	K-fold CV all models (× optional y-transforms) on all features	`best_model_name`, `step1_agg_df`
2	Fit best model → SHAP importance → correlation-based feature drop	`features_reduced`, `selector`
3	K-fold CV all models on reduced features (winning transform only); compare vs step 1	`best_stage` (`"default"` or `"reduced"`)
4	Optuna HPO on the winning (model, stage, transform)	`best_params`, `step4_agg_df`

Step 4 is skipped when the winning model has opt_enabled=False in the registry (e.g. LogisticRegression, GaussianNB, KNeighborsClassifier).

`PipelineConfig` reference

Parameter	Default	Description
`task_type`	`TaskType.CLASSIFICATION`	`CLASSIFICATION` or `REGRESSION`
`metric`	`"AUC"`	Short metric name for model selection and HPO. Minimisation direction is derived automatically. Classification: `"AUC"`, `"AUC_PR"`, `"F1"`, `"LogLoss"`, `"Brier"`. Regression: `"R2"`, `"MAE"`, `"RMSE"`.
`n_splits`	`5`	Folds for steps 1 and 3
`opt_n_splits`	`3`	Folds used inside Optuna (fewer = faster)
`corr_threshold`	`0.7`	Correlation threshold for dropping features in step 2. A feature is dropped if it exceeds this value in any of Pearson, Spearman, or Kendall correlation with a higher-ranked feature.
`n_trials`	`50`	Optuna trials in step 4
`n_hpo_repeats`	`1`	Independent HPO runs with different fold seeds; best is kept
`min_features`	`1`	Minimum features retained after the correlation filter
`handle_imbalance`	`False`	Inject `class_weight="balanced"` for supporting classifiers
`random_state`	`42`	Global seed
`verbose`	`False`	Log step-by-step progress to stdout

Spatial CV parameters

Set store.coords to an (n_samples, 2) array of spatial coordinates to activate spatial cross-validation. All parameters below are ignored when coords is None.

Parameter	Default	Description
`spatial_cv_method`	`"block"`	`"block"` (quantile-grid) or `"spcv"` (AHC + cluster ensemble)
`spatial_cv_metric`	`"euclidean"`	`"euclidean"` or `"haversine"` (expects lat/lon in degrees)
`n_blocks_per_fold`	`5`	Blocks per test fold for the block splitter
`ahc_threshold`	`None`	AHC distance threshold for `spcv`; auto-set to 10th percentile of pairwise distances when `None`
`exact_max_samples`	`5000`	n ≤ this → exact scipy AHC; n > → approximate sklearn AHC with k-NN graph
`knn_neighbors`	`15`	k for the k-NN connectivity graph in approximate AHC
`pca_components`	`0.95`	Variance retained by PCA on block covariates in `spcv` stage 2

Supported models

Classifiers — LogisticRegression, GaussianNB, KNeighborsClassifier, RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier, SVC, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, LGBMClassifier*, CatBoostClassifier*, XGBClassifier*

Regressors — PoissonRegressor, KNeighborsRegressor, RandomForestRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor, SVR, ExtraTreesRegressor, BaggingRegressor, AdaBoostRegressor, LGBMRegressor*, CatBoostRegressor*, XGBRegressor*

* Registered only when the package is installed. Custom models can be injected by passing a models list directly to H2MLPipeline.

`PipelineResult`

result.summary()                  # combined agg DataFrame across all completed stages
result.summary("AUC_Test_Mean")   # sorted by metric
result.completed_steps            # e.g. [1, 2, 3, 4]
result.best_model_name            # winning model
result.best_stage                 # "default" | "reduced" | "optimized"
result.y_transform                # winning y-transform (regression only)
result.cv_type                    # "spatial" | "random" — set from store.coords
result.cv_warnings                # list of warning strings for models with failed folds
result.step1_agg_df               # per-model mean/std metrics from step 1
result.features_reduced           # PipelineData after feature selection
result.selector.importance_summary()  # SHAP importances as a DataFrame

Exporting the final model

from h2ml.pipeline.final_model import FinalModel

final = result.build_final_model()   # fits on full training set
final.predict(X_new)
final.predict_proba(X_new)           # classification only

final.save("models/final.pkl")
final = FinalModel.load("models/final.pkl")

FinalModel.predict() accepts a DataFrame (columns aligned by name) or a numpy array (must match feature_names order).

Conformal prediction intervals

build_final_model() automatically calibrates a conformal predictor from the out-of-fold CV predictions — no held-out data required.

final = result.build_final_model()

# Regression — 90% prediction interval for each sample
lower, upper = final.predict_interval(X_new, alpha=0.10)

# Classification — prediction set for each sample
sets = final.predict_set(X_new, alpha=0.10)
# sets[i] == [1]    → confident prediction of class 1
# sets[i] == [0]    → confident prediction of class 0
# sets[i] == [0, 1] → uncertain; true label could be either

Both methods work on any input — held-out test samples, a prediction grid, spatial rasters, etc. The alpha parameter controls the miscoverage level: alpha=0.10 targets ≥ 90% coverage.

How it works: nonconformity scores (|y − ŷ| for regression, 1 − p(true class) for classification) are computed from the OOF folds and a single threshold q is stored. At inference time the interval is ŷ ± q (regression) or the set of classes with score ≤ q (classification).

Limitations:

Intervals are constant-width — the same q is added to every prediction, so regions of the input space with higher inherent variance get the same interval as low-variance regions.
Coverage is marginal, not conditional: the guarantee holds on average over new draws from the training distribution. Predictions on out-of-distribution inputs (e.g. spatial extrapolation beyond the training extent) may not achieve nominal coverage.
If result.y_transform is set, the interval is in the transformed space. Apply INVERSE_TRANSFORMS[result.y_transform] to the bounds if you need original-scale intervals.

Persistence

from h2ml import PipelineResult

result.save("runs/experiment_01")
result = PipelineResult.load("runs/experiment_01")

DataFrames are serialised as Parquet, numpy arrays as .npy, and Python objects (selector, CV results) as joblib pickles under a single directory.

Comparing runs

from h2ml.evaluation.compare import compare_results

r1 = pipeline_a.run(store)
r2 = pipeline_b.run(store)

df = compare_results([r1, r2], labels=["baseline", "spatial_cv"], metric="AUC")

Returns a DataFrame with one row per result: Run, Metric, Best_Model, Best_Stage, Y_Transform, Score_Mean, Score_Std, Conservative_Bound (variance-penalised score), Brier_Mean, OOF_Brier, N_Features, Completed_Steps.

Visualization

from h2ml.plots.plots import (
    pipeline_scores,    # model scores across all three pipeline stages
    cv_diagnostics,     # classification or regression diagnostic panel
    shap_importance,    # horizontal bar chart of SHAP feature importances
    shap_summary_plot,  # SHAP beeswarm for the final best model
    shap_dependence,    # scatter + lowess for top-N features
)

pipeline_scores(result, save_path="plots/scores.png")
shap_importance(result.selector, save_path="plots/shap.png")

All functions accept an optional save_path; omit it to call plt.show() instead.

Spatial inference (h2mare integration)

h2ml.geo.geo_predict provides functions for spatial-temporal prediction on gridded data via the companion h2mare package:

from h2ml.geo.geo_predict import predict_map

predict_map(
    model=final,
    indexer=indexer,         # h2mare.ParquetIndexer
    dates=("2020-01", "2020-12"),
    bbox=(lon_min, lat_min, lon_max, lat_max),
    target_col="pm25",
    agg_by="month",
    save_path="maps/pm25_2020.png",
)

RunMetadata

Attach experiment labels to results for multi-run comparison:

from h2ml.evaluation.metrics import RunMetadata

pipeline = H2MLPipeline(
    config=config,
    metadata=RunMetadata(schema="v2_features", target="pm25", batch="2024-01"),
)

Labels appear as columns in all fold and agg DataFrames, making it easy to concatenate results across runs.

Contributing

Contributions are welcome. To set up a development environment:

git clone https://github.com/h2ugoparra/h2ml
cd h2ml
uv sync --group dev
uv run pytest

Please submit issues or pull requests on GitHub.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This project was developed under the framework of COSTA project.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

h2ugoparra

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

h2ml-0.1.0.tar.gz (387.5 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

h2ml-0.1.0-py3-none-any.whl (88.4 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file h2ml-0.1.0.tar.gz.

File metadata

Download URL: h2ml-0.1.0.tar.gz
Upload date: May 7, 2026
Size: 387.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for h2ml-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3b51e9374f94a02d78103bd04285b3801e1f35946d429b617b200a344ec4ce80`
MD5	`679cd427f580c740e9db4f08ae5bc29e`
BLAKE2b-256	`dab51dc194e0f684541437dd875f9612530ebdb53af38bd8dad7b1d5e8054406`

See more details on using hashes here.

Provenance

The following attestation bundles were made for h2ml-0.1.0.tar.gz:

Publisher: release.yml on h2ugoparra/h2ml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: h2ml-0.1.0.tar.gz
- Subject digest: 3b51e9374f94a02d78103bd04285b3801e1f35946d429b617b200a344ec4ce80
- Sigstore transparency entry: 1462018034
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: h2ugoparra/h2ml@bd4474d624ce21807a746d191a8a8bf7cb6f0b27
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/h2ugoparra
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bd4474d624ce21807a746d191a8a8bf7cb6f0b27
- Trigger Event: release

File details

Details for the file h2ml-0.1.0-py3-none-any.whl.

File metadata

Download URL: h2ml-0.1.0-py3-none-any.whl
Upload date: May 7, 2026
Size: 88.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for h2ml-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6d5797068aa24a240eb9c033fe68bef02faa6f8328d53d9473629af0c0e3bdd6`
MD5	`85d8f1408c392c2dd569cbe1d29a88bf`
BLAKE2b-256	`11229623211fc268109957c8aaddb70ea8ec66222354c574d3a799866539215b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for h2ml-0.1.0-py3-none-any.whl:

Publisher: release.yml on h2ugoparra/h2ml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: h2ml-0.1.0-py3-none-any.whl
- Subject digest: 6d5797068aa24a240eb9c033fe68bef02faa6f8328d53d9473629af0c0e3bdd6
- Sigstore transparency entry: 1462018038
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: h2ugoparra/h2ml@bd4474d624ce21807a746d191a8a8bf7cb6f0b27
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/h2ugoparra
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bd4474d624ce21807a746d191a8a8bf7cb6f0b27
- Trigger Event: release

h2ml 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

H2ML

Installation

Quick start

Regression with y-transform sweep

Partial runs

The 4-step pipeline

PipelineConfig reference

Spatial CV parameters

Supported models

PipelineResult

Exporting the final model

Conformal prediction intervals

Persistence

Comparing runs

Visualization

Spatial inference (h2mare integration)

RunMetadata

Contributing

License

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`PipelineConfig` reference

`PipelineResult`