Skip to main content

Uplift modeling integrated with scikit-learn

Project description

libuplift

Uplift modeling package based on and integrated with scikit-learn.

Authors: Szymon Jaroszewicz, Krzysztof Rudaś

Design goals

The design goal of libuplift is to seamlessly integrate with scikit-learn and follow its conventions as closely as possible. It is possible to use model evaluation and tuning facilities from scikit-learn either directly or as thin wrappers provided by libuplift.

Features

  • A comprehensive collection of datasets for uplift modeling (we believe this is the most complete collection of randomized datasets)
    • marketing and advertising datasets
    • medical RTC datasets
  • Tight integration with scikit-learn: model evaluation routines can be used just as in scikit-learn
  • Meta-models: T/S/X Learners, transformed target learner

Getting started

To install libuplift simply use

pip install libuplift

or to get the latest version install directly from Github

pip install git+https://github.com/jszymon/uplift-sklearn

Let us now build an uplift model on the well known Hillstrom dataset. Begin with the necessary imports:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression

Now fetch the dataset and do basic preprocessing

from libuplift.datasets import fetch_Hillstrom
D = fetch_Hillstrom(as_frame=True)
trt = D.treatment
# encode categorical features, standardize numerical features
ct = ColumnTransformer([("ohe", OneHotEncoder(), list(D.categ_values.keys()))],
                       remainder=StandardScaler())
X = ct.fit_transform(D.data)
# keep only women's campaign
mask = ~(trt == 1)
X = X[mask]
y = D.target_visit[mask]
trt = (trt[mask] == 2)*1

By libuplift convention, treatments are denoted by successive integers with 0 indicating controls. Addtionally the special n_trt argument is passed to all methods to indicate the number of treatments (if n_trt is None it will be inferred automatically, but this may be unreliable and is discouraged).

Now, we're ready to fit an uplift model (TLearner in our case)

X_train, X_test, y_train, y_test, trt_train, trt_test = train_test_split(X, y, trt, train_size=0.7)
m = TLearnerUpliftClassifier(base_estimator=LogisticRegression())
m.fit(X_train, y_train, trt_train, n_trt=1)

and draw an uplift curve

import matplotlib.pyplot as plt
from libuplift.metrics import uplift_curve, area_under_uplift_curve

score = m.predict(X_test)[:,1]
print("AUUC=", area_under_uplift_curve(y_test, score, trt_test, n_trt=1))
cx, cy = uplift_curve(y_test, score, trt_test, n_trt=1)
plt.plot(cx, cy)
plt.plot([0,1], [0,cy[-1]], "k-")
plt.show()

An uplift curve

One can use cross_val_score and GridSearchCV to easily evaluate models or tune their parameters, just as one does in scikit-learn. The functions provided by libuplift are thin wrappers of original scikit-learn functions so they behave exactly the same as they would for standard classifiers.

# import those from libuplift instead of sklearn
from libuplift.model_selection import cross_val_score
from libuplift.model_selection import GridSearchCV

m1 = TLearnerUpliftClassifier(base_estimator=LogisticRegression())
m_cv1 = GridSearchCV(m1,
                     {"base_estimator__C":[1e-1,1,1e1,1e2,1e3]},
                     cv=3, n_jobs=-1)
# tune regularization of treatment/control models separately
m2 = TLearnerUpliftClassifier(base_estimator=[("model_c", LogisticRegression()),
                                              ("model_t", LogisticRegression())])
m_cv2 = GridSearchCV(m2,
                    {"model_c__C":[1e-1,1,1e1,1e2,1e3],
                    "model_t__C":[1e-1,1,1e1,1e2,1e3]},
                    cv=3, n_jobs=-1)

Now evaluate both models using crossvalidated Area Under Uplift Curve

auuc_m1 = np.mean(cross_val_score(m_cv1, X, y, trt, n_trt=1, cv=5, scoring="auuc"))
auuc_m2 = np.mean(cross_val_score(m_cv2, X, y, trt, n_trt=1, cv=5, scoring="auuc"))
print("crossval AUUC m1:", auuc_m1)
print("crossval AUUC m2:", auuc_m2)

Finally, do a permutation test and draw a learning curve. Again the functions below are thin wrappers of original scikit-learn functions so they accept the same set of parameters.

from libuplift.model_selection import permutation_test_score, learning_curve

score, permutation_scores, pv =\
    permutation_test_score(m, X, y, trt, n_trt=1, cv=3,
                           n_permutations=100, scoring="auuc",
                           verbose=10, n_jobs=-1)

fix, (ax0, ax1) = plt.subplots(ncols=2)
ax0.hist(permutation_scores, density=True, label=f"p-value={pv}")
ax0.axvline(score, color="r")
ax0.set_title("Permutation test")

train_sizes, train_scores, test_scores = learning_curve(m, X, y, trt, n_trt=1, scoring="auuc")

train_scores_mean = train_scores.mean(axis=1)
train_scores_std = train_scores.std(axis=1)
test_scores_mean = test_scores.mean(axis=1)
test_scores_std = test_scores.std(axis=1)
ax1.fill_between(train_sizes,
                 train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std,
                 alpha=0.1, color='r')
ax1.plot(train_sizes, train_scores_mean, 'ro-', label="Train score")
ax1.fill_between(train_sizes,
                 test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std,
                 alpha=0.1, color='g')
ax1.plot(train_sizes, test_scores_mean, 'go-', label="Test score")
ax1.legend()
ax1.yaxis.tick_right()
ax1.set_title("Learning curve")
plt.show()

An uplift curve

We can see that the model is significantly better than random guessing and optimal performance seems to be achieved aleady with 10000 training records.

Documenation

The documentation is available on Read the Docs

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libuplift-0.1.tar.gz (110.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

libuplift-0.1-py3-none-any.whl (136.7 kB view details)

Uploaded Python 3

File details

Details for the file libuplift-0.1.tar.gz.

File metadata

  • Download URL: libuplift-0.1.tar.gz
  • Upload date:
  • Size: 110.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for libuplift-0.1.tar.gz
Algorithm Hash digest
SHA256 253b9857abfa87c17d388da70063aeaa277e104d1c00d62a34532c6dc85ec2e0
MD5 42e25a0d3ef35db0447c94e76cffca3e
BLAKE2b-256 fcb21f0de0359e309fc7f5657c4c6977849faca6c8ab413aa2bf3dbf6f05ee23

See more details on using hashes here.

File details

Details for the file libuplift-0.1-py3-none-any.whl.

File metadata

  • Download URL: libuplift-0.1-py3-none-any.whl
  • Upload date:
  • Size: 136.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for libuplift-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4a480e93f6c30a5ec3e66b91af247e5778fe3f26f2d32ce4b93e5bcc8bd29aa7
MD5 9ec716a8955c326a2f961421fde42541
BLAKE2b-256 79abe3b394a88836bb4302d4fa69ed1964a3762c9f5be48c97d4d4906973064a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page