Skip to main content

A Python package for machine learning and regression adjustments of treatment effects estimates in randomized experiments

Project description

pyregadj

A Python package for regression and machine learning adjustments of treatment effects in randomized experiments.

Installation

You can install the package using pip:

pip install pyregadj

Features

  • Implements difference-in-means estimator for randomized experiments
  • Supports covariate-adjusted linear regression methods: pooled and groupwise
  • Machine learning adjustment with support for Random Forest, Gradient Boosting, Lasso, Ridge, and Elastic Net.
  • Optional cross-fitting with $K$-folds
  • Clean API for pandas DataFrame input

Quick Start

import pandas as pd
import numpy as np
from pyregadj import RegAdjustRCT

# Generate dummy data
np.random.seed(1988)
n = 200

# Create covariates
age = np.random.normal(35, 10, n)
income = np.random.normal(50000, 20000, n)
education = np.random.normal(14, 3, n)

# Generate treatment assignment (50% treated)
treatment = np.random.binomial(1, 0.5, n)

# Generate outcome with treatment effect
baseline = 100 + 0.5 * age + 0.001 * income + 2 * education + np.random.normal(0, 10, n)
outcome = baseline + 15 * treatment  # True treatment effect = 15

# Create DataFrame
data = pd.DataFrame({
    'y': outcome,
    'd': treatment,
    'age': age,
    'income': income,
    'education': education
})

# Initialize calculator
te = RegAdjustRCT()

# 1. Difference-in-means
result1 = te.calculate(data, outcome='y', treatment='d', adjustment='none')

# 2. Pooled regression
result2 = te.calculate(data, outcome='y', treatment='d', adjustment='pooled', 
                      covariates=['age', 'income', 'education'])

Examples

You can find detailed usage examples in the examples/ directory.

Statistical Methods

This package estimates average treatment effects (ATE) in randomized experiments using unadjusted, regression-adjusted, and machine learning-based estimators. All estimators assume a binary treatment variable, and they provide valid inference under standard assumptions of randomization.

Notation

Let $Y(1)$ and $Y(0)$ denote the potential outcomes under treatment and control, respectively. The observed outcome is

$$ Y = D \cdot Y(1) + (1 - D) \cdot Y(0), $$

where $D \in {0,1}$ is the binary treatment indicator. Let $X \in \mathbb{R}^p$ be a vector of covariates. We assume there are $n_1$ and $n_0$ units in the treatment ($D=1$) and control ($D=0$) groups respectively. We denote their sample standard deviations of $Y$ by $s_1$ and $s_0$.

The estimand of interest is the average treatment effect (ATE):

$$ \tau = \mathbb{E}[Y(1) - Y(0)]. $$

Assumptions

This package assumes that treatment assignment is randomized:

$$ (Y(1), Y(0)) \perp!!!\perp D $$

When properly implemented, randomization ensures this assumption is satisfied. Under this assumption, the estimators described below are consistent for the ATE.

Unadjusted Estimator

This estimator compares average outcomes in the treatment and control groups:

$$ \hat{\tau}{\text{unadj}} = \frac{1}{n_1}\sum{i:D=1}Y - \frac{1}{n_0}\sum_{i:D=0}Y. $$

A two-sample $t$-test with pooled variance is used to construct standard errors and confidence intervals:

$$ SE_{\text{unadj}} = \sqrt{ \hat{\sigma}_{\text{unadj}}^2 \left( \frac{1}{n_1} + \frac{1}{n_0} \right) }, $$

where

$$ \hat{\sigma}_{\text{unadj}}^2 = \frac{(n_1 - 1)s_1^2 + (n_0 - 1)s_0^2}{n_1 + n_0 - 2}. $$

This estimator is equivalent to the simple linear model

$$ Y = \alpha + \tau_{\text{unadj}} D + \varepsilon, $$

although the regression framework allows for more flexible variance estimation.

Pooled (Linear) Regression-Adjusted Estimator

This estimator fits a linear regression of the outcome on treatment and covariates:

$$ Y = \alpha + \tau_{\text{pool}} D + X^\top \beta + \varepsilon. $$

The coefficient $\hat{\tau}_{\text{pool}}$ on $D$ estimates the ATE. Standard heteroskedasticity-robust standard errors (HC1) are used for inference. Optionally, covariates can be mean-centered.

This estimator can improve precision over the unadjusted estimator, particularly when covariates are predictive of the outcome.

Groupwise (Linear) Regression-Adjusted Estimator

This estimator fits separate linear models for treatment and control groups:

$$ Y = \alpha_d + X^\top \beta_d + \varepsilon_d, \quad \text{for } D = d \in {0,1}. $$

The ATE is estimated as:

$$ \hat{\tau}{\text{grp}} = \frac{1}{n_1} \sum{i:D=1} \hat{\mu}1(X_i) - \frac{1}{n_0} \sum{i:D=0} \hat{\mu}_0(X_i), $$

where $\hat{\mu}_d(X_i)$ is the predicted outcome for group $d$ with covariate values $X_i$.

Standard errors are computed using residual variances from the two regressions.

$$ SE_{\text{grp}} = \sqrt{ \hat{\sigma}_{\text{grp}}^2 \left( \frac{1}{n_1} + \frac{1}{n_0} \right) }, $$

where

$$ \hat{\sigma}{\text{grp}}^2 = \frac{(n_1 - 1)s{\hat{\varepsilon}1}^2 + (n_0 - 1)s{\hat{\varepsilon}_0}^2}{n_1 + n_0 - 2}. $$

We now use the variance of the estimated residuals $\hat{\varepsilon}_1$, and $\hat{\varepsilon}_0$. This estimator allows covariate effects to differ by treatment group and may improve robustness and precision (similar to Lin (2013)).

Machine Learning-Adjusted Estimators

These estimators use flexible models (e.g., random forests, gradient boosting, lasso, etc.) to estimate the conditional expectation

$$ \mu_d(X) = \mathbb{E}[Y \mid D=d, X] \quad \text{for } D = d \in {0,1}. $$

The ATE is estimated via:

$$ \hat{\tau}{\text{ML}} = \frac{1}{n} \sum{i=1}^{n} \left( \hat{m}_1(X_i) - \hat{m}_0(X_i) \right)

  • \frac{D_i}{\hat{p}_1}(Y_i - \hat{m}_1(X_i))
  • \frac{1 - D_i}{\hat{p}_0}(Y_i - \hat{m}_0(X_i)), $$

where $\hat{m}_d(X_i)$ is the predicted outcome under treatment $d$, and $\hat{p}_d$ is the empirical treatment probability and $n=n_1+n_0$.

When cross_fit=True, predictions are obtained via sample-splitting and $K$-fold cross-fitting to mitigate overfitting. This means predictions for each unit $i$ are from models trained without $i$’s fold.

Variance is estimated from the influence functions (IFs):

$$ SE_{\text{ML}} = \sqrt{\frac{\hat{\sigma}^2_{\text{ML}}}{n}}, $$

where

$$ \hat{\sigma}^2_{\text{ML}} = \widehat{\mathrm{Var}}!\big(IF_{1} - IF_{0}\big), $$

and $\text{IF}_d$ is the respective IF for group $d$.

This estimator offers a flexible and efficient approach, particularly in high dimensions when the methods above do not have attractive properties.

Practical Considerations

  • All estimators are valid under random assignment, but adjusted estimators (linear or ML) may yield narrower confidence intervals.
  • Mean-centering covariates may improve interpretability but does not affect consistency.
  • The ML estimators rely on user-specified models (rf, gbm, lasso, ridge, elastic-net) and can be cross-fit for robustness.
  • For small samples, linear adjustment may be preferable due to reduced variance.
  • Covariates should not be post-treatment or affected by treatment.

References

  • Lin, W. (2013). "Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman’s Critique". Annals of Applied Statistics.
  • List, J. A., Muir, I., & Sun, G. (2024). Using machine learning for efficient flexible regression adjustment in economic experiments. Econometric Reviews, 44(1), 2-40.
  • Negi, A., & Wooldridge, J. M. (2021). Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Reviews, 40(5), 504-534.
  • Wu, E., & Gagnon-Bartsch, J. A. (2018). The LOOP estimator: Adjusting for covariates in randomized experiments. Evaluation review, 42(4), 458-488.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

To cite this package in publications, please use the following BibTeX entry:

@misc{yasenov2025pytreateffects,
  author       = {Vasco Yasenov},
  title        = {pytreateffects: Treatment Effect Estimation for Randomized Experiments in Python},
  year         = {2025},
  howpublished = {\url{https://github.com/vyasenov/pytreateffects}},
  note         = {Version 0.1.0}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyregadj-0.1.0.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyregadj-0.1.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file pyregadj-0.1.0.tar.gz.

File metadata

  • Download URL: pyregadj-0.1.0.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for pyregadj-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4a3cbdd063b9553a95772b0ceb25feee7951ca5e8510272e8adebe4d30c50f45
MD5 2e31b6a767426ab47b7da57f43221d3d
BLAKE2b-256 ea8b857990d8dbbd2f57c3f591237290bc94a9cf6436f8b64039c29a7b144ba2

See more details on using hashes here.

File details

Details for the file pyregadj-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyregadj-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for pyregadj-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0eb0a17896868ef3e7372c82a4c6855db347b34d4c003d952c96cfbbf61bde0d
MD5 fb0ca1945d6dd73e0e983ffea9cc6b27
BLAKE2b-256 2539ac1283a38b9a1c8f4432efe4e98bdcaac53924cbafb69be3577cc2955d50

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page