Interactive Benchmarking for Machine Learning.
Project description
Powerlift
Advancing the state of machine learning?
With 5-10 datasets? Wake me up when I'm dead.
Powerlift is all about testing machine learning techniques across many, many datasets. So many, that we had run into design of experiment concerns. So many, that we had to develop a package for it.
Yes, we run this for InterpretML on as many docker containers we can run in parallel on. Why wait days for benchmark evalations when you can wait for minutes? Rhetorical question, please don't hurt me.
def trial_filter(task):
if task.problem == "binary" and task.n_samples <= 10000:
return ["rf", "svm"]
return []
def trial_runner(trial):
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
if trial.task.problem == "binary":
X, y = trial.task.data()
# Holdout split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3)
# Build preprocessor
is_cat = meta["categorical_mask"]
cat_cols = [idx for idx in range(X.shape[1]) if is_cat[idx]]
num_cols = [idx for idx in range(X.shape[1]) if not is_cat[idx]]
cat_ohe_step = ("ohe", OneHotEncoder(sparse_output=True, handle_unknown="ignore"))
cat_pipe = Pipeline([cat_ohe_step])
num_pipe = Pipeline([("identity", FunctionTransformer())])
transformers = [("cat", cat_pipe, cat_cols), ("num", num_pipe, num_cols)]
ct = Pipeline(
[
("ct", ColumnTransformer(transformers=transformers)),
(
"missing",
SimpleImputer(add_indicator=True, strategy="most_frequent"),
),
]
)
# Connect preprocessor with target learner
if trial.method == "svm":
clf = Pipeline([("ct", ct), ("est", CalibratedClassifierCV(LinearSVC()))])
else:
clf = Pipeline([("ct", ct), ("est", RandomForestClassifier())])
# Train
clf.fit(X_tr, y_tr)
# Predict
predictions = clf.predict_proba(X_te)[:, 1]
# Score
auc = roc_auc_score(y_te, predictions)
trial.log("auc", auc)
import os
from powerlift.bench import Benchmark, Store
from powerlift.bench import populate_with_datasets
# Initialize database (if needed).
conn_str = f"sqlite:///{os.getcwd()}/powerlift.db"
store = Store(conn_str, force_recreate=False)
# This downloads datasets once and feeds into the database.
populate_with_datasets(store, cache_dir="~/.powerlift", exist_ok=True)
# Run experiment
benchmark = Benchmark(f"sqlite:///{os.getcwd()}/powerlift.db", name="SVM vs RF")
benchmark.run(trial_runner, trial_filter)
benchmark.wait_until_complete()
This can also be run on Azure Container Instances where needed.
# Run experiment (but in ACI).
from powerlift.executors import AzureContainerInstance
store = Store(os.getenv("AZURE_DB_URL"))
azure_tenant_id = os.getenv("AZURE_TENANT_ID")
subscription_id = os.getenv("AZURE_SUBSCRIPTION_ID")
azure_client_id = os.getenv("AZURE_CLIENT_ID")
azure_client_secret = os.getenv("AZURE_CLIENT_SECRET")
resource_group = os.getenv("AZURE_RESOURCE_GROUP")
executor = AzureContainerInstance(
store,
azure_tenant_id,
subscription_id,
azure_client_id,
azure_client_secret=azure_client_secret,
resource_group=resource_group,
n_running_containers=5
)
benchmark = Benchmark(store, name="SVM vs RF")
benchmark.run(trial_runner, trial_filter, timeout=10, executor=executor)
benchmark.wait_until_complete()
Install
pip install powerlift[datasets]
That's it, go get 'em boss.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
File details
Details for the file powerlift-0.1.12-py3-none-any.whl
.
File metadata
- Download URL: powerlift-0.1.12-py3-none-any.whl
- Upload date:
- Size: 28.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 23103f6f1074ddc70fb52ff7c31cb04f0c5d1d105747176e63d098b4e6bc355e |
|
MD5 | 87adb2c60186f7e4f106866a1d7eb9f4 |
|
BLAKE2b-256 | 6b7bed0f7f2f9b7532a26bc64b3cf9cbee1c0d27d2cde56254dbb371a083ddbf |