Skip to main content

Propensity Score Matching (PSM) on Python

Project description

pysmatch

PyPI version Downloads GitHub License codecov

Propensity Score Matching (PSM) helps reduce selection bias in observational studies by matching treatment and control units with similar propensity scores.

pysmatch is an improved and extended version of pymatch, with modernized modeling, modularized matching utilities, and better support for reproducible workflows.

Multilingual

English | 中文

Highlights

  • Multiple score models: Logistic Regression, KNN, CatBoost
  • Flexible balancing: oversampling and undersampling (balance_strategy)
  • Standard and exhaustive matching workflows
  • Balance diagnostics for categorical and continuous covariates
  • Optional Optuna tuning for automated model search

Installation

Install from PyPI:

pip install pysmatch

Install optional extras:

pip install "pysmatch[tree]"   # CatBoost support
pip install "pysmatch[tune]"   # Optuna support
pip install "pysmatch[all]"    # all optional dependencies

Install from source:

git clone https://github.com/miaohancheng/pysmatch.git
cd pysmatch
pip install -e ".[all]"

Quickstart

This minimal example runs the full core path with the built-in demo dataset (misc/loan.csv).

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from pysmatch.Matcher import Matcher

np.random.seed(42)
data = pd.read_csv("misc/loan.csv")

test = data[data.loan_status == "Default"].copy()
control = data[data.loan_status == "Fully Paid"].copy()

matcher = Matcher(
    test=test,
    control=control,
    yvar="is_default",
    exclude=["loan_status"],
)

matcher.fit_scores(
    balance=True,
    balance_strategy="over",
    nmodels=10,
    model_type="linear",
    n_jobs=2,
)
matcher.predict_scores()
matcher.match(method="min", nmatches=1, threshold=0.001, replacement=False)

print(matcher.matched_data.head())

If this works, continue to the full workflow below.

End-to-End Workflow

Data Preparation

Use domain-relevant covariates and avoid leaking post-treatment variables into matching features.

import pandas as pd

fields = [
    "loan_amnt",
    "funded_amnt",
    "funded_amnt_inv",
    "term",
    "int_rate",
    "installment",
    "grade",
    "sub_grade",
    "loan_status",
]

raw = pd.read_csv("misc/loan.csv", usecols=fields)
test = raw[raw.loan_status == "Default"].copy()
control = raw[raw.loan_status == "Fully Paid"].copy()

Initialize Matcher

from pysmatch.Matcher import Matcher

matcher = Matcher(
    test=test,
    control=control,
    yvar="is_default",
    exclude=["loan_status"],
)

print("xvars:", matcher.xvars)
print("test/control:", matcher.testn, matcher.controln)

Fit Propensity Score Models

fit_scores supports three model types:

  • linear (logistic regression)
  • knn
  • tree (CatBoost, requires pysmatch[tree])
matcher.fit_scores(
    balance=True,
    balance_strategy="over",   # "over" or "under"
    nmodels=10,
    model_type="linear",
    max_iter=200,
    n_jobs=2,
)

print("models:", len(matcher.models))
print("avg validation accuracy:", sum(matcher.model_accuracy) / len(matcher.model_accuracy))

Optuna path (single tuned model):

# matcher.fit_scores(
#     balance=True,
#     model_type="tree",
#     use_optuna=True,
#     n_trials=20,
# )

Predict and Plot Scores

matcher.predict_scores()
matcher.plot_scores()

matcher.data now contains a scores column.

Tune Threshold

import numpy as np

matcher.tune_threshold(
    method="min",
    nmatches=1,
    rng=np.arange(0.0001, 0.0051, 0.0005),
)

Choose a threshold that balances quality and retained sample size.

Run Matching

Standard matching:

matcher.match(
    method="min",
    nmatches=1,
    threshold=0.001,
    replacement=False,
    exhaustive_matching=False,
)
matcher.plot_matched_scores()

Exhaustive matching:

matcher.match(
    threshold=0.001,
    nmatches=1,
    exhaustive_matching=True,
)

Review Matched Data and Weights

print(matcher.matched_data.head())
print(matcher.record_frequency().head())
matcher.assign_weight_vector()
print(matcher.matched_data[["record_id", "match_id", "weight"]].head())

Matching Strategies

Standard vs Exhaustive Matching

  • Standard (exhaustive_matching=False): uses nearest-neighbor style control selection with configurable method/replacement behavior.
  • Exhaustive (exhaustive_matching=True): prioritizes wider control utilization while still respecting threshold constraints.

Key Parameters

  • threshold: max allowed score distance
  • nmatches: controls per treated unit
  • replacement: whether a control can be reused
  • method: "min" (closest) or "random" (random within threshold)

Practical Guidance

  • Start with nmatches=1, replacement=False, and a moderate threshold.
  • If retention is too low, loosen threshold gradually.
  • If balance is weak after matching, tighten threshold or change model/balance strategy.
  • For severe class imbalance, test balance_strategy="under" as sensitivity analysis.

Evaluation

After matching, evaluate covariate balance before causal analysis.

Categorical Covariates

cat_table = matcher.compare_categorical(return_table=True, plot_result=True)
print(cat_table)

Interpretation:

  • check before/after p-value shifts
  • look for reduced proportional differences after matching

Continuous Covariates

cont_table = matcher.compare_continuous(return_table=True, plot_result=True)
print(cont_table)

Interpretation:

  • compare KS statistics and grouped permutation test p-values
  • monitor standardized mean/median differences pre vs post matching

Single Variable Proportion Test

print(matcher.prop_test("grade"))

Troubleshooting

ValueError: numpy.dtype size changed

This is usually a NumPy/Pandas binary compatibility issue.

pip install --upgrade --force-reinstall "numpy>=1.26.4" "pandas>=2.1.4"

Restart your Python kernel/session after reinstalling.

Scores column not found

Run predict_scores() before match().

matcher.fit_scores(...)
matcher.predict_scores()
matcher.match(...)

FileNotFoundError for dataset path

Use a repo-relative path:

pd.read_csv("misc/loan.csv")

No matches found

Usually threshold is too strict or groups are weakly overlapping.

  • increase threshold
  • try a different model_type
  • inspect score distributions with plot_scores()

Jupyter kernel issues in notebooks

If your notebook kernel name is unavailable, switch to an existing kernel (python3) and rerun cells.

FAQ

When should I use linear vs tree vs knn?

  • Start with linear for strong baseline interpretability.
  • Use tree for nonlinear relationships and mixed feature types.
  • Use knn as a local-structure baseline and compare sensitivity.

Is high model accuracy always better for matching?

Not necessarily. Very high separability may indicate weak overlap, which can reduce matchability. Balance diagnostics matter more than raw classifier accuracy.

Should I use over- or under-sampling?

  • over: usually keeps more majority information; good default.
  • under: faster/smaller training sets; useful for sensitivity checks.

How do I make runs reproducible?

  • set np.random.seed(...)
  • keep fixed package versions
  • record model/matching parameters in experiment logs

Additional resources

  • Sekhon, J. S. (2011), Multivariate and propensity score matching software with automated balance optimization: The Matching package for R. Journal of Statistical Software, 42(7), 1-52. Link
  • Rosenbaum, P. R., & Rubin, D. B. (1983), The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55. Link

Contributing

Contributions are welcome. Please open an issue or pull request in this repository.

License

pysmatch is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysmatch-2.1.2.tar.gz (282.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pysmatch-2.1.2-py3-none-any.whl (28.1 kB view details)

Uploaded Python 3

File details

Details for the file pysmatch-2.1.2.tar.gz.

File metadata

  • Download URL: pysmatch-2.1.2.tar.gz
  • Upload date:
  • Size: 282.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for pysmatch-2.1.2.tar.gz
Algorithm Hash digest
SHA256 68c4299dda7d7ab60042536e2f39432b47b7fdc3f96c2744e89abfe0a5b0408e
MD5 ea3b138a47d0415b0390b3095645ffd0
BLAKE2b-256 5a18ff417414840779ea808f3d0b56eb626f037aa6dafd5d2d07263fc6274cc6

See more details on using hashes here.

File details

Details for the file pysmatch-2.1.2-py3-none-any.whl.

File metadata

  • Download URL: pysmatch-2.1.2-py3-none-any.whl
  • Upload date:
  • Size: 28.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for pysmatch-2.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0734e28defa2ce7ee60435fffb47cd6d878c87a89e8e190cae991b956512cee6
MD5 e3acd7f32afbb01e9b2af705230529a3
BLAKE2b-256 d481df32927ec7b1bff7aa786f95bfe96d70dc3ae46859096949ee0bef076a85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page