Propensity score matching for python and graphical plots
Project description
PsmPy
Matching techniques for epidemiological observational studies as carried out in Python. Propensity score matching is a statistical matching technique used with observational data that attempts to ascertain the validity of concluding there is a potential causal link between a treatment or intervention and an outcome(s) of interest. It does so by accounting for a set of covariates between a binary treatment state (as would occur in a randomized control trial, either received the intervention or not), and control for potential confounding (covariates) in outcome measures between the treatment and control groups such as death, or length of stay etc. It is using this technique on observational data that we gain an insight into the effects or lack thereof of an interventional state.
Citing this work:
A. Kline, Y. Luo, PsmPy: A Package for Retrospective Cohort Matching in Python, (under review at EMBC 2022)
- Integration with Jupyter Notebooks
- Additional plotting functionality to assess balance before and after
- A more modular, user-specified matching process
- Ability to define 1:1 or 1:many matching
Installation
Install the package through pip:
$ pip install psmpy
- Installation
- Data Preparation
- Predict Scores
- Matching algorithm
- Graphical Outputs
- Extra Attributes
- Conclusion
Data Prep
# import relevant libraries
import numpy as np
from scipy.special import logit, expit
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LogisticRegression
import math
import pandas.api.types as ptypes
import seaborn as sns
sns.set(rc={'figure.figsize':(10,8)}, font_scale = 1.3)
import numpy as np
#import psmypy package
from PsmPy.functions import cohenD
import PsmPy
# read in your data
data = pd.read_csv(path)
Initialize PsmPy Class
Initialize the PsmPy
class:
psm = PsmPy(df, outcome=’treatment’, indx =’pat_id’, exclude = [])
Note:
PsmPy
- The class. It will use all covariates in the dataset unless formally excluded in theexclude
argument.df
- the dataframe being passed to the classexclude
- (optional) parameter and will ignore any covariates (columns) passed to the it during the model fitting process. This will be a list of strings. Note, it is not necessary to pass the unique index column here. That process will be taken care of within the code after specifying your index column.indx
- required parameter that references a unique ID number for each case in the dataset.outcome
- required parameter that represents a binary outcome of interest (0 or 1) and differentiates the treatment from control or success/fail of each group.
Predict Scores
Calculate logistic propensity scores/logits:
psm.logistic_ps(balance = True)
Note:
balance
- Whether the logistic regression will run in a balanced fashion, default = True.
There often exists a significant Class Imbalance in the data. This will be detected automatically in the software where the majority group has more records than the minority group. We account for this by setting balance=True
when calling psm.logistic_ps()
. This tells PsmPy
to sample from the majority group when fitting the logistic regression model so that the groups are of equal size. This process is repeated until all the entries of the major class have been regressed on the minor class in equal paritions. This calculates both the logistic propensity scores and logits for each entry.
Matching algorithm
Perform KNN matching
psm.knn_matched(matcher=’propensity_logit’, replacement=False,caliper=None)
Note:
matcher
-propensity_logit
(default) and generated inprevious step alternative option ispropensity_score
,specifies the argument on which matching will proceedreplacement
-False
(default), determines whethermacthing will happen with or without replacement,when replacement is false matching happens 1:1caliper
-None
(default), user can specify caliper size relative to std. dev of the control sample, restricting neighbors eligible to match within a certain distance.
Graphical Outputs
Plot the propensity score or propensity logits
Plot the distribution of the propensity scores (or logits) for the two groups side by side.
psm.plot_match(Title=’Side by side matched controls’, Ylabel=’Number ofpatients’, Xlabel= ’Propensity logit’,names = [’treatment’, ’control’],save=True)
Note:
title
- ’Side by side matched controls’ (default),creates plot titleYlabel
- ’Number of patients’ (default), string, labelfor y-axisXlabel
- ’Propensity logit’ (default), string, label for x-axisnames
- [’treatment’, ’control’] (default), list of strings for legendsave
- False (default), saves the figure generated to current working directory if True
Plot the effect sizes
psm.effect_size_plot(save=False)
Note:
save
- False (default), saves the figure generated tocurrent working directory if True
Extra Attributes
Other attributes available to user:
Matched IDs
psm.matched_ids
matched_ids
- returns a dataframe of indicies from the minor class and their associated matched indice from the major class psm.
Major_ID | Minor_ID |
---|---|
6781 | 9432 |
3264 | 7624 |
Note:
That not all matches will be unique if replacement=False
Effect sizes per variable
psm.effect_size
effect_size
- returns dataframe with columns ’variable’, ’matching’ (before or after), and ’effect_size
variable | matching | effect_size |
---|---|---|
hypertension | before | 0.5 |
hypertension | after | 0.01 |
age | 7624 | 9432 |
age | 7624 | 9432 |
sex | 7624 | 9432 |
Note: The thresholds for a small, medium and large effect size were characterizedby Cohen in: J. Cohen, "A Power Primer", Quantitative Methods in Psychology, vol.111, no. 1, pp. 155-159, 1992
Relative Size | Effect Size |
---|---|
small | ≤ 0.2 |
medium | ≤ 0.5 |
large | ≤0.8 |
Conclusion
This package offers a user friendly propensity score matching protocol created for a Python environment. In this we have tried to capture automatic figure generation, contextualization of the results and flexibility in the matching and modeling protocol to serve a wide base.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.