Skip to main content

Propensity score matching for python and graphical plots

Project description

PsmPy

Matching techniques for epidemiological observational studies as carried out in Python. Propensity score matching is a statistical matching technique used with observational data that attempts to ascertain the validity of concluding there is a potential causal link between a treatment or intervention and an outcome(s) of interest. It does so by accounting for a set of covariates between a binary treatment state (as would occur in a randomized control trial, either received the intervention or not), and control for potential confounding (covariates) in outcome measures between the treatment and control groups such as death, or length of stay etc. It is using this technique on observational data that we gain an insight into the effects or lack thereof of an interventional state.


Citing this work:

A. Kline, Y. Luo, PsmPy: A Package for Retrospective Cohort Matching in Python, (under review at EMBC 2022)


  • Integration with Jupyter Notebooks
  • Additional plotting functionality to assess balance before and after
  • A more modular, user-specified matching process
  • Ability to define 1:1 or 1:many matching

Installation

Install the package through pip:

$ pip install psmpy

Data Prep

# import relevant libraries
sns.set(rc={'figure.figsize':(10,8)}, font_scale = 1.3)
# read in your data
data = pd.read_csv(path)

Import psmpy class and functions

from psmpy import PsmPy
from psmpy.functions import cohenD
from psmpy.plotting import *

Initialize PsmPy Class

Initialize the PsmPy class:

psm = PsmPy(df, treatment='treatment', indx='pat_id', exclude = [])

Note:

  • PsmPy - The class. It will use all covariates in the dataset unless formally excluded in the exclude argument.
  • df - the dataframe being passed to the class
  • exclude - (optional) parameter and will ignore any covariates (columns) passed to the it during the model fitting process. This will be a list of strings. Note, it is not necessary to pass the unique index column here. That process will be taken care of within the code after specifying your index column.
  • indx - required parameter that references a unique ID number for each case in the dataset.

Predict Scores

Calculate logistic propensity scores/logits:

psm.logistic_ps(balance = True)

Note:

  • balance - Whether the logistic regression will run in a balanced fashion, default = True.

There often exists a significant Class Imbalance in the data. This will be detected automatically in the software where the majority group has more records than the minority group. We account for this by setting balance=True when calling psm.logistic_ps(). This tells PsmPy to sample from the majority group when fitting the logistic regression model so that the groups are of equal size. This process is repeated until all the entries of the major class have been regressed on the minor class in equal paritions. This calculates both the logistic propensity scores and logits for each entry.

Review values in dataframe:

psm.predicted_data

Matching algorithm

Perform KNN matching

psm.knn_matched(matcher='propensity_logit', replacement=False, caliper=None)

Note:

  • matcher - propensity_logit (default) and generated inprevious step alternative option is propensity_score, specifies the argument on which matching will proceed
  • replacement - False (default), determines whethermacthing will happen with or without replacement,when replacement is false matching happens 1:1
  • caliper - None (default), user can specify caliper size relative to std. dev of the control sample, restricting neighbors eligible to match within a certain distance.

Graphical Outputs

Plot the propensity score or propensity logits

Plot the distribution of the propensity scores (or logits) for the two groups side by side.

psm.plot_match(Title='Side by side matched controls', Ylabel='Number ofpatients', Xlabel= 'Propensity logit',names = ['treatment', 'control'],save=True)

Note:

  • title - 'Side by side matched controls' (default),creates plot title
  • Ylabel - 'Number of patients' (default), string, labelfor y-axis
  • Xlabel - 'Propensity logit' (default), string, label for x-axis
  • names - ['treatment', 'control'] (default), list of strings for legend
  • save - False (default), saves the figure generated to current working directory if True

Plot the effect sizes

psm.effect_size_plot(save=False)

Note:

  • save - False (default), saves the figure generated tocurrent working directory if True

Extra Attributes

Other attributes available to user:

Matched IDs

psm.matched_ids
  • matched_ids - returns a dataframe of indicies from the minor class and their associated matched indice from the major class psm.
Major_ID Minor_ID
6781 9432
3264 7624

Note: That not all matches will be unique if replacement=False

Effect sizes per variable

psm.effect_size
  • effect_size - returns dataframe with columns 'variable', 'matching' (before or after), and 'effect_size'
variable matching effect_size
hypertension before 0.5
hypertension after 0.01
age 7624 9432
age 7624 9432
sex 7624 9432

Note: The thresholds for a small, medium and large effect size were characterizedby Cohen in: J. Cohen, "A Power Primer", Quantitative Methods in Psychology, vol.111, no. 1, pp. 155-159, 1992

Relative Size Effect Size
small ≤ 0.2
medium ≤ 0.5
large ≤0.8

Conclusion

This package offers a user friendly propensity score matching protocol created for a Python environment. In this we have tried to capture automatic figure generation, contextualization of the results and flexibility in the matching and modeling protocol to serve a wide base.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

psmpy-0.2.9.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

psmpy-0.2.9-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file psmpy-0.2.9.tar.gz.

File metadata

  • Download URL: psmpy-0.2.9.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.8.1 keyring/23.1.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.7

File hashes

Hashes for psmpy-0.2.9.tar.gz
Algorithm Hash digest
SHA256 6561e04211f62e454c41cb0da79330d9ca62536175f58311a5d1df1709143848
MD5 de5abcea528dc811ed731c54504248e9
BLAKE2b-256 59cdea78ae7885460958f452d19371b142b5e7577f996784869dd11782669485

See more details on using hashes here.

File details

Details for the file psmpy-0.2.9-py3-none-any.whl.

File metadata

  • Download URL: psmpy-0.2.9-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.8.1 keyring/23.1.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.7

File hashes

Hashes for psmpy-0.2.9-py3-none-any.whl
Algorithm Hash digest
SHA256 613b8f7225f0353495e9f454cfa91760a3a6823d2faa8f09b48f43fd194bc16c
MD5 69134a565a64db6d9cc400769beda72d
BLAKE2b-256 1e0ef7f35262cc367521aa75f06eb0a59c612672552a91a6862ccdeca010b347

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page