Machine learning based causal inference/uplift in Python

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Intended Audience
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language

Project description

Machine learning based causal inference/uplift in Python

Contents: Application • Data and Examples • To-Do • References

causeinfer is a Python package for estimating average and conditional average treatment effects using machine learning. Its goal is to compile causal inference models both standard and advanced, as well as demonstrate their usage and efficacy - all this with the overarching ambition to help people learn CI techniques across business, medical, and socioeconomic fields. See the documentation for a full outline of the package including models and datasets.

Installation

causeinfer can be downloaded from PyPI via pip or sourced directly from this repository:

pip install causeinfer

git clone https://github.com/andrewtavis/causeinfer.git
cd causeinfer
python setup.py install

import causeinfer

Application `↩`

Causal inference algorithms:

Two Model Approach

Separate models for treatment and control groups are trained and combined to derive average treatment effects (Hansotia, 2002).

from causeinfer.standard_algorithms import TwoModel from sklearn.ensemble import RandomForestClassifier tm = TwoModel( treatment_model=RandomForestClassifier(**kwargs), control_model=RandomForestClassifier(**kwargs), ) tm.fit(X=X_train, y=y_train, w=w_train) # An array of predictions given a treatment and control model tm_preds = tm.predict(X=X_test) # An array of predicted treatment class probabilities given models tm_probas = tm.predict_proba(X=X_test)

Interaction Term Approach

An interaction term between treatment and covariates is added to the data to allow for a basic single model application (Lo, 2002).

from causeinfer.standard_algorithms import InteractionTerm from sklearn.ensemble import RandomForestClassifier it = InteractionTerm(model=RandomForestClassifier(**kwargs)) it.fit(X=X_train, y=y_train, w=w_train) # An array of predictions given a treatment and control interaction term it_preds = it.predict(X=X_test) # An array of predicted treatment class probabilities given interaction terms it_probas = it.predict_proba(X=X_test)

Class Transformation Approaches

Units are categorized into two or four classes to derive treatment effects from favorable class attributes (Lai, 2006; Kane, et al, 2014; Shaar, et al, 2016).

# Binary Class Transformation from causeinfer.standard_algorithms import BinaryTransformation from sklearn.ensemble import RandomForestRegressor bt = BinaryTransformation(model=RandomForestRegressor(**kwargs), regularize=True) bt.fit(X=X_train, y=y_train, w=w_train) # An array of predicted probabilities (P(Favorable Class), P(Unfavorable Class)) bt_probas = bt.predict_proba(X=X_test)

# Quaternary Class Transformation from causeinfer.standard_algorithms import QuaternaryTransformation from sklearn.ensemble import RandomForestRegressor qt = QuaternaryTransformation(model=RandomForestRegressor(**kwargs), regularize=True) qt.fit(X=X_train, y=y_train, w=w_train) # An array of predicted probabilities (P(Favorable Class), P(Unfavorable Class)) qt_probas = qt.predict_proba(X=X_test)

Generalized Random Forest (in progress)

A wrapper application of honest causalaity based splitting random forests - via the R/C++ grf (Athey, Tibshirani, and Wager, 2019).

# Example code in progress

Further Models to Consider

Under consideration for inclusion in causeinfer:

Reflective and Pessimistic Uplift - Shaar, et al (2016)

The X-Learner - Kunzel, et al (2019)

The R-Learner - Nie and Wager (2017)

Double Machine Learning - Chernozhukov, et al (2018)

Information Theory Trees/Forests - Soltys, et al (2015)

Evaluation metrics:

Visualization Metrics and Coefficients

Comparisons across stratified, ordered treatment response groups are used to derive model efficiency.

from causeinfer.evaluation import plot_cum_gain, plot_qini visual_eval_dict = { "y_test": y_test, "w_test": w_test, "two_model": tm_effects, "interaction_term": it_effects, "binary_trans": bt_effects, "quaternary_trans": qt_effects, } df_visual_eval = pd.DataFrame(visual_eval_dict, columns=visual_eval_dict.keys()) model_pred_cols = [ col for col in visual_eval_dict.keys() if col not in ["y_test", "w_test"] ]

fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=False, figsize=(20, 5)) plot_cum_gain( df=df_visual_eval, n=100, models=models, percent_of_pop=True, outcome_col="y_test", treatment_col="w_test", normalize=True, random_seed=42, figsize=None, fontsize=20, axis=ax1, legend_metrics=True, ) plot_qini( df=df_visual_eval, n=100, models=models, percent_of_pop=True, outcome_col="y_test", treatment_col="w_test", normalize=True, random_seed=42, figsize=None, fontsize=20, axis=ax2, legend_metrics=True, )

Hillstrom Metrics

CMF Microfinance Metrics

Iterated Model Variance Analysis

Quickly iterate models to derive their average effects and prediction variance. See a full example across all datasets and models in the following notebook.

from causeinfer.evaluation import iterate_model, eval_table n = num_iterations avg_preds, all_preds, avg_eval, eval_variance, eval_sd, all_evals = iterate_model( model=model, X_train=dataset_keys[dataset]["X_train"], y_train=dataset_keys[dataset]["y_train"], w_train=dataset_keys[dataset]["w_train"], X_test=dataset_keys[dataset]["X_test"], y_test=dataset_keys[dataset]["y_test"], w_test=dataset_keys[dataset]["w_test"], tau_test=None, n=n, pred_type="predict_proba", eval_type="qini", normalize_eval=False, notify_iter=n / 10, ) model_eval_dict[dataset].update( { str(model) .split(".")[-1] .split(" ")[0]: { "avg_preds": avg_preds, "all_preds": all_preds, "avg_eval": avg_eval, "eval_variance": eval_variance, "eval_sd": eval_sd, "all_evals": all_evals, } } ) df_model_eval = eval_table(model_eval_dict, variances=True, annotate_vars=True) df_model_eval

TwoModel InteractionTerm BinaryTransformation QuaternaryTransformation

Hillstrom 3.541 ± 4.25** 3.533 ± 4.015** 2.197 ± 1.439* 1.483 ± 1.677*

Mayo PBC -0.073 ± 0.114 -0.135 ± 0.176 -0.705 ± 0.125 -0.310 ± 0.123

CMF Microfinance 16.262 ± 6.648** 15.448 ± 4.115** nan nan

GRF Econometric Evaluations (in progress)

Confidence intervals are created using GRF's honesty based, Gaussian asymptotic forest summations.

# Example code in progress

Data and Examples ↩

Business Analytics

Hillstrom Email Marketing

Is directly downloaded and formatted with CauseInfer (see script).

Example notebook.

from causeinfer.data import hillstrom hillstrom.download_hillstrom() data_hillstrom = hillstrom.load_hillstrom( user_file_path="datasets/hillstrom.csv", format_covariates=True, normalize=True ) df = pd.DataFrame( data_hillstrom["dataset_full"], columns=data_hillstrom["dataset_full_names"] )

Criterio Uplift

Download and formatting script in progress.

Example notebook to follow.

Medical Trials

Mayo Clinic PBC

Is directly downloaded and formatted with causeinfer (see script).

Also included in the datasets directory for direct download.

Example notebook.

from causeinfer.data import mayo_pbc mayo_pbc.download_mayo_pbc() data_mayo_pbc = mayo_pbc.load_mayo_pbc( user_file_path="datasets/mayo_pbc.text", format_covariates=True, normalize=True ) df = pd.DataFrame( data_mayo_pbc["dataset_full"], columns=data_mayo_pbc["dataset_full_names"] )

Pintilie Tamoxifen

Accompanied the linked text, but is now unavailable. It is included in the datasets directory for direct download.

Formatting script in progress.

Example notebook to follow.

Socioeconomic Analysis

CMF Microfinance

Accompanied the linked text, but is now unavailable. It is included in the datasets directory for direct download.

Is formatted with causeinfer (see script).

Example notebook.

from causeinfer.data import cmf_micro data_cmf_micro = cmf_micro.load_cmf_micro( user_file_path="datasets/cmf_micro", format_covariates=True, normalize=True ) df = pd.DataFrame( data_cmf_micro["dataset_full"], columns=data_cmf_micro["dataset_full_names"] )

Lalonde Job Training

Download and formatting script in progress.

Example notebook to follow.

Simmulated Data

Work is currently being done to add a data generator, thus allowing for theoretical tests with known treatment effects.

Example notebook to follow.

To-Do ↩

Adding more baseline models and datasets

Converting GRF files to Python and connecting to C++ boiler plate

Creating, improving, and sharing examples

Adding predict to binary_transformation and quaternary_transformation

Updating and refining the documentation

Improving tests for greater code coverage

Similar Projects

Similar packages and modules to causeinfer

Python

https://github.com/uber/causalml

https://github.com/Minyus/causallift

https://github.com/maks-sh/scikit-uplift

https://github.com/duketemon/pyuplift

https://github.com/microsoft/EconML

https://github.com/Microsoft/dowhy

https://github.com/wayfair/pylift/

https://github.com/jszymon/uplift_sklearn

Other Languages

https://github.com/grf-labs/grf (R/C++)

https://github.com/soerenkuenzel/causalToolbox/X-Learner (R/C++)

https://github.com/xnie/rlearner (R)

References ↩

Full list of theoretical references

Big Data and Machine Learning

Athey, S. (2017). Beyond prediction: Using big data for policy problems. Science, Vol. 355, No. 6324, February 3, 2017, pp. 483-485.

Athey, S. & Imbens, G. (2015). Machine Learning Methods for Estimating Heterogeneous Causal Effects. Draft version submitted April 5th, 2015, arXiv:1504.01132v1, pp. 1-25.

Athey, S. & Imbens, G. (2019). Machine Learning Methods That Economists Should Know About. Annual Review of Economics, Vol. 11, August 2019, pp. 685-725.

Chernozhukov, V. et al. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, Vol. 21, No. 1, February 1, 2018, pp. C1–C68.

Mullainathan, S. & Spiess, J. (2017). Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives, Vol. 31, No. 2, Spring 2017, pp. 87-106.

Causal Inference

Athey, S. & Imbens, G. (2017). The State of Applied Econometrics: Causality and Policy Evaluation. Journal of Economic Perspectives, Vol. 31, No. 2, Spring 2017, pp. 3-32.

Athey, S., Tibshirani, J. & Wager, S. (2019) Generalized random forests. The Annals of Statistics, Vol. 47, No. 2 (2019), pp. 1148-1178.

Athey, S. & Wager, S. (2019). Efficient Policy Learning. Draft version submitted on 9 Feb 2017, last revised 16 Sep 2019, arXiv:1702.02896v5, pp. 1-10.

Banerjee, A, et al. (2015) The Miracle of Microfinance? Evidence from a Randomized Evaluation. American Economic Journal: Applied Economics, Vol. 7, No. 1, January 1, 2015, pp. 22-53.

Ding, P. & Li, F. (2018). Causal Inference: A Missing Data Perspective. Statistical Science, Vol. 33, No. 2, 2018, pp. 214-237.

Farrell, M., Liang, T. & Misra S. (2018). Deep Neural Networks for Estimation and Inference: Application to Causal Effects and Other Semiparametric Estimands. Draft version submitted December 2018, arXiv:1809.09953, pp. 1-54.

Gutierrez, P. & Gérardy, JY. (2016). Causal Inference and Uplift Modeling: A review of the literature. JMLR: Workshop and Conference Proceedings 67, 2016, pp. 1–14.

Hitsch, G J. & Misra, S. (2018). Heterogeneous Treatment Effects and Optimal Targeting Policy Evaluation. January 28, 2018, Available at SSRN: ssrn.com/abstract=3111957 or dx.doi.org/10.2139/ssrn.3111957, pp. 1-64.

Powers, S. et al. (2018). Some methods for heterogeneous treatment effect estimation in high dimensions. Statistics in Medicine, Vol. 37, No. 11, May 20, 2018, pp. 1767-1787.

Rosenbaum, P. & Rubin, D. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, Vol. 70, pp. 41-55.

Sekhon, J. (2007). The Neyman-Rubin Model of Causal Inference and Estimation via Matching Methods. The Oxford Handbook of Political Methodology, Winter 2017, pp. 1-46.

Wager, S. & Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. Journal of the American Statistical Association, Vol. 113, 2018 - Issue 523, pp. 1228-1242.

Uplift

Devriendt, F. et al. (2018). A Literature Survey and Experimental Evaluation of the State-of-the-Art in Uplift Modeling: A Stepping Stone Toward the Development of Prescriptive Analytics. Big Data, Vol. 6, No. 1, March 1, 2018, pp. 1-29. Codes found at: data-lab.be/downloads.php.

Hansotia, B. & Rukstales, B. (2002). Incremental value modeling. Journal of Interactive Marketing, Vol. 16, No. 3, Summer 2002, pp. 35-46.

Haupt, J., Jacob, D., Gubela, R. & Lessmann, S. (2019). Affordable Uplift: Supervised Randomization in Controlled Experiments. Draft version submitted on October 1, 2019, arXiv:1910.00393v1, pp. 1-15.

Jaroszewicz, S. & Rzepakowski, P. (2014). Uplift modeling with survival data. Workshop on Health Informatics (HI-KDD) New York City, August 2014, pp. 1-8.

Jaśkowski, M. & Jaroszewicz, S. (2012). Uplift modeling for clinical trial data. In: ICML, 2012, Workshop on machine learning for clinical data analysis. Edinburgh, Scotland, June 2012, 1-8.

Kane, K., Lo, VSY. & Zheng, J. (2014). Mining for the truly responsive customers and prospects using true-lift modeling: Comparison of new and existing methods. Journal of Marketing Analytics, Vol. 2, No. 4, December 2014, pp 218–238.

Lai, L.Y.-T. (2006). Influential marketing: A new direct marketing strategy addressing the existence of voluntary buyers. Master of Science thesis, Simon Fraser University School of Computing Science, Burnaby, BC, Canada, pp. 1-68.

Lo, VSY. (2002). The true lift model: a novel data mining approach to response modeling in database marketing. SIGKDD Explor 4(2), pp. 78–86.

Lo, VSY. & Pachamanova, D. (2016). From predictive uplift modeling to prescriptive uplift analytics: A practical approach to treatment optimization while accounting for estimation risk. Journal of Marketing Analytics Vol. 3, No. 2, pp. 79–95.

Radcliffe N.J. & Surry, P.D. (1999). Differential response analysis: Modeling true response by isolating the effect of a single action. In Proceedings of Credit Scoring and Credit Control VI. Credit Research Centre, University of Edinburgh Management School.

Radcliffe N.J. & Surry, P.D. (2011). Real-World Uplift Modelling with Significance-Based Uplift Trees. Technical Report TR-2011-1, Stochastic Solutions, 2011, pp. 1-33.

Rzepakowski, P. & Jaroszewicz, S. (2012). Decision trees for uplift modeling with single and multiple treatments. Knowledge and Information Systems, Vol. 32, pp. 303–327.

Rzepakowski, P. & Jaroszewicz, S. (2012). Uplift modeling in direct marketing. Journal of Telecommunications and Information Technology, Vol. 2, 2012, pp. 43–50.

Rudaś, K. & Jaroszewicz, S. (2018). Linear regression for uplift modeling. Data Mining and Knowledge Discovery, Vol. 32, No. 5, September 2018, pp. 1275–1305.

Shaar, A., Abdessalem, T. and Segard, O (2016). “Pessimistic Uplift Modeling”. ACM SIGKDD, August 2016, San Francisco, California, USA.

Sołtys, M., Jaroszewicz, S. & Rzepakowski, P. (2015). Ensemble methods for uplift modeling. Data Mining and Knowledge Discovery, Vol. 29, No. 6, November 2015, pp. 1531–1559.

	TwoModel	InteractionTerm	BinaryTransformation	QuaternaryTransformation
Hillstrom	3.541 ± 4.25**	3.533 ± 4.015**	2.197 ± 1.439*	1.483 ± 1.677*
Mayo PBC	-0.073 ± 0.114	-0.135 ± 0.176	-0.705 ± 0.125	-0.310 ± 0.123
CMF Microfinance	16.262 ± 6.648**	15.448 ± 4.115**	nan	nan

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Intended Audience
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

1.0.2

Jul 9, 2022

1.0.1

Jun 3, 2022

1.0.0

Dec 28, 2021

0.1.2.1

Apr 28, 2021

0.1.1.7

Apr 3, 2021

0.1.1.6

Mar 30, 2021

0.1.1.5

Mar 28, 2021

0.1.1.4

Mar 22, 2021

This version

0.1.1.3

Mar 21, 2021

0.1.1.2

Mar 18, 2021

0.1.1.1

Mar 17, 2021

0.1.1

Mar 17, 2021

0.1.0

Feb 25, 2021

0.0.6.1

Jan 27, 2021

0.0.6

Jan 27, 2021

0.0.5.9

Jan 26, 2021

0.0.5.8

Jan 25, 2021

0.0.5.7

Dec 12, 2020

0.0.5.6

Feb 2, 2020

0.0.5.5

Feb 2, 2020

0.0.5.4

Feb 2, 2020

0.0.5.3

Feb 2, 2020

0.0.5.2

Feb 2, 2020

0.0.5.1

Feb 2, 2020

0.0.5

Jan 27, 2020

0.0.4.3

Jan 17, 2020

0.0.4.2

Jan 9, 2020

0.0.4.1

Jan 9, 2020

0.0.4

Nov 21, 2019

0.0.3.3

Nov 20, 2019

0.0.3.2

Nov 20, 2019

0.0.3.1

Nov 20, 2019

0.0.3

Nov 20, 2019

0.0.2

Nov 20, 2019

0.0.1

Nov 20, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

causeinfer-0.1.1.3-py3-none-any.whl (41.9 kB view hashes)

Uploaded Mar 21, 2021 Python 3

Hashes for causeinfer-0.1.1.3-py3-none-any.whl

Hashes for causeinfer-0.1.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b6b51deff02f426a48aa530f9ea6253ef53d3e18acfb05cb02fb2528c71f99b5`
MD5	`b1c7f57221011b30c119256c15a2fc2e`
BLAKE2b-256	`70a6bd2d1d44d9410fdc1c3a125fb246fdb990101c1de9fac6918a59ce958369`

causeinfer 0.1.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Machine learning based causal inference/uplift in Python

Installation

Application `↩`

Causal inference algorithms:

Evaluation metrics:

Data and Examples `↩`

To-Do `↩`

Similar Projects

References `↩`

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

causeinfer 0.1.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Machine learning based causal inference/uplift in Python

Installation

Application ↩

Causal inference algorithms:

Evaluation metrics:

Data and Examples ↩

To-Do ↩

Similar Projects

References ↩

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

Application `↩`

Data and Examples `↩`

To-Do `↩`

References `↩`