Uplift modeling integrated with scikit-learn
Project description
libuplift
Uplift modeling package based on and integrated with scikit-learn.
Authors: Szymon Jaroszewicz, Krzysztof Rudaś
Design goals
The design goal of libuplift is to seamlessly integrate with scikit-learn and follow its conventions as closely as possible. It is possible to use model evaluation and tuning facilities from scikit-learn either directly or as thin wrappers provided by libuplift.
Features
- A comprehensive collection of datasets for uplift modeling (we believe this is the most complete collection of randomized datasets)
- marketing and advertising datasets
- medical RTC datasets
- Tight integration with scikit-learn: model evaluation routines can be used just as in scikit-learn
- Meta-models: T/S/X Learners, transformed target learner
Getting started
To install libuplift simply use
pip install libuplift
or to get the latest version install directly from Github
pip install git+https://github.com/jszymon/uplift-sklearn
Let us now build an uplift model on the well known Hillstrom dataset. Begin with the necessary imports:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
Now fetch the dataset and do basic preprocessing
from libuplift.datasets import fetch_Hillstrom
D = fetch_Hillstrom(as_frame=True)
trt = D.treatment
# encode categorical features, standardize numerical features
ct = ColumnTransformer([("ohe", OneHotEncoder(), list(D.categ_values.keys()))],
remainder=StandardScaler())
X = ct.fit_transform(D.data)
# keep only women's campaign
mask = ~(trt == 1)
X = X[mask]
y = D.target_visit[mask]
trt = (trt[mask] == 2)*1
By libuplift convention, treatments are denoted by successive integers with 0 indicating controls. Addtionally the special n_trt argument is passed to all methods to indicate the number of treatments (if n_trt is None it will be inferred automatically, but this may be unreliable and is discouraged).
Now, we're ready to fit an uplift model (TLearner in our case)
X_train, X_test, y_train, y_test, trt_train, trt_test = train_test_split(X, y, trt, train_size=0.7)
m = TLearnerUpliftClassifier(base_estimator=LogisticRegression())
m.fit(X_train, y_train, trt_train, n_trt=1)
and draw an uplift curve
import matplotlib.pyplot as plt
from libuplift.metrics import uplift_curve, area_under_uplift_curve
score = m.predict(X_test)[:,1]
print("AUUC=", area_under_uplift_curve(y_test, score, trt_test, n_trt=1))
cx, cy = uplift_curve(y_test, score, trt_test, n_trt=1)
plt.plot(cx, cy)
plt.plot([0,1], [0,cy[-1]], "k-")
plt.show()
One can use cross_val_score and GridSearchCV to easily evaluate
models or tune their parameters, just as one does in scikit-learn.
The functions provided by libuplift are thin wrappers of
original scikit-learn functions so they behave exactly the same as
they would for standard classifiers.
# import those from libuplift instead of sklearn
from libuplift.model_selection import cross_val_score
from libuplift.model_selection import GridSearchCV
m1 = TLearnerUpliftClassifier(base_estimator=LogisticRegression())
m_cv1 = GridSearchCV(m1,
{"base_estimator__C":[1e-1,1,1e1,1e2,1e3]},
cv=3, n_jobs=-1)
# tune regularization of treatment/control models separately
m2 = TLearnerUpliftClassifier(base_estimator=[("model_c", LogisticRegression()),
("model_t", LogisticRegression())])
m_cv2 = GridSearchCV(m2,
{"model_c__C":[1e-1,1,1e1,1e2,1e3],
"model_t__C":[1e-1,1,1e1,1e2,1e3]},
cv=3, n_jobs=-1)
Now evaluate both models using crossvalidated Area Under Uplift Curve
auuc_m1 = np.mean(cross_val_score(m_cv1, X, y, trt, n_trt=1, cv=5, scoring="auuc"))
auuc_m2 = np.mean(cross_val_score(m_cv2, X, y, trt, n_trt=1, cv=5, scoring="auuc"))
print("crossval AUUC m1:", auuc_m1)
print("crossval AUUC m2:", auuc_m2)
Finally, do a permutation test and draw a learning curve. Again the functions below are thin wrappers of original scikit-learn functions so they accept the same set of parameters.
from libuplift.model_selection import permutation_test_score, learning_curve
score, permutation_scores, pv =\
permutation_test_score(m, X, y, trt, n_trt=1, cv=3,
n_permutations=100, scoring="auuc",
verbose=10, n_jobs=-1)
fix, (ax0, ax1) = plt.subplots(ncols=2)
ax0.hist(permutation_scores, density=True, label=f"p-value={pv}")
ax0.axvline(score, color="r")
ax0.set_title("Permutation test")
train_sizes, train_scores, test_scores = learning_curve(m, X, y, trt, n_trt=1, scoring="auuc")
train_scores_mean = train_scores.mean(axis=1)
train_scores_std = train_scores.std(axis=1)
test_scores_mean = test_scores.mean(axis=1)
test_scores_std = test_scores.std(axis=1)
ax1.fill_between(train_sizes,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.1, color='r')
ax1.plot(train_sizes, train_scores_mean, 'ro-', label="Train score")
ax1.fill_between(train_sizes,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.1, color='g')
ax1.plot(train_sizes, test_scores_mean, 'go-', label="Test score")
ax1.legend()
ax1.yaxis.tick_right()
ax1.set_title("Learning curve")
plt.show()
We can see that the model is significantly better than random guessing and optimal performance seems to be achieved aleady with 10000 training records.
Documenation
The documentation is available on Read the Docs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file libuplift-0.1.tar.gz.
File metadata
- Download URL: libuplift-0.1.tar.gz
- Upload date:
- Size: 110.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
253b9857abfa87c17d388da70063aeaa277e104d1c00d62a34532c6dc85ec2e0
|
|
| MD5 |
42e25a0d3ef35db0447c94e76cffca3e
|
|
| BLAKE2b-256 |
fcb21f0de0359e309fc7f5657c4c6977849faca6c8ab413aa2bf3dbf6f05ee23
|
File details
Details for the file libuplift-0.1-py3-none-any.whl.
File metadata
- Download URL: libuplift-0.1-py3-none-any.whl
- Upload date:
- Size: 136.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a480e93f6c30a5ec3e66b91af247e5778fe3f26f2d32ce4b93e5bcc8bd29aa7
|
|
| MD5 |
9ec716a8955c326a2f961421fde42541
|
|
| BLAKE2b-256 |
79abe3b394a88836bb4302d4fa69ed1964a3762c9f5be48c97d4d4906973064a
|