Propensity Score Matching (PSM) on Python
Project description
pysmatch
Propensity Score Matching (PSM) helps reduce selection bias in observational studies by matching treatment and control units with similar propensity scores.
pysmatch is an improved and extended version of pymatch, with modernized modeling, modularized matching utilities, and better support for reproducible workflows.
Multilingual
Highlights
- Multiple score models: Logistic Regression, KNN, CatBoost
- Flexible balancing: oversampling and undersampling (
balance_strategy) - Standard and exhaustive matching workflows
- Balance diagnostics for categorical and continuous covariates
- Optional Optuna tuning for automated model search
Installation
Install from PyPI:
pip install pysmatch
Install optional extras:
pip install "pysmatch[tree]" # CatBoost support
pip install "pysmatch[tune]" # Optuna support
pip install "pysmatch[all]" # all optional dependencies
Install from source:
git clone https://github.com/miaohancheng/pysmatch.git
cd pysmatch
pip install -e ".[all]"
Quickstart
This minimal example runs the full core path with the built-in demo dataset (misc/loan.csv).
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from pysmatch.Matcher import Matcher
np.random.seed(42)
data = pd.read_csv("misc/loan.csv")
test = data[data.loan_status == "Default"].copy()
control = data[data.loan_status == "Fully Paid"].copy()
matcher = Matcher(
test=test,
control=control,
yvar="is_default",
exclude=["loan_status"],
)
matcher.fit_scores(
balance=True,
balance_strategy="over",
nmodels=10,
model_type="linear",
n_jobs=2,
)
matcher.predict_scores()
matcher.match(method="min", nmatches=1, threshold=0.001, replacement=False)
print(matcher.matched_data.head())
If this works, continue to the full workflow below.
End-to-End Workflow
Data Preparation
Use domain-relevant covariates and avoid leaking post-treatment variables into matching features.
import pandas as pd
fields = [
"loan_amnt",
"funded_amnt",
"funded_amnt_inv",
"term",
"int_rate",
"installment",
"grade",
"sub_grade",
"loan_status",
]
raw = pd.read_csv("misc/loan.csv", usecols=fields)
test = raw[raw.loan_status == "Default"].copy()
control = raw[raw.loan_status == "Fully Paid"].copy()
Initialize Matcher
from pysmatch.Matcher import Matcher
matcher = Matcher(
test=test,
control=control,
yvar="is_default",
exclude=["loan_status"],
)
print("xvars:", matcher.xvars)
print("test/control:", matcher.testn, matcher.controln)
Fit Propensity Score Models
fit_scores supports three model types:
linear(logistic regression)knntree(CatBoost, requirespysmatch[tree])
matcher.fit_scores(
balance=True,
balance_strategy="over", # "over" or "under"
nmodels=10,
model_type="linear",
max_iter=200,
n_jobs=2,
)
print("models:", len(matcher.models))
print("avg validation accuracy:", sum(matcher.model_accuracy) / len(matcher.model_accuracy))
Optuna path (single tuned model):
# matcher.fit_scores(
# balance=True,
# model_type="tree",
# use_optuna=True,
# n_trials=20,
# )
Predict and Plot Scores
matcher.predict_scores()
matcher.plot_scores()
matcher.data now contains a scores column.
Tune Threshold
import numpy as np
matcher.tune_threshold(
method="min",
nmatches=1,
rng=np.arange(0.0001, 0.0051, 0.0005),
)
Choose a threshold that balances quality and retained sample size.
Run Matching
Standard matching:
matcher.match(
method="min",
nmatches=1,
threshold=0.001,
replacement=False,
exhaustive_matching=False,
)
matcher.plot_matched_scores()
Exhaustive matching:
matcher.match(
threshold=0.001,
nmatches=1,
exhaustive_matching=True,
)
Review Matched Data and Weights
print(matcher.matched_data.head())
print(matcher.record_frequency().head())
matcher.assign_weight_vector()
print(matcher.matched_data[["record_id", "match_id", "weight"]].head())
Matching Strategies
Standard vs Exhaustive Matching
- Standard (
exhaustive_matching=False): uses nearest-neighbor style control selection with configurable method/replacement behavior. - Exhaustive (
exhaustive_matching=True): prioritizes wider control utilization while still respecting threshold constraints.
Key Parameters
threshold: max allowed score distancenmatches: controls per treated unitreplacement: whether a control can be reusedmethod:"min"(closest) or"random"(random within threshold)
Practical Guidance
- Start with
nmatches=1,replacement=False, and a moderate threshold. - If retention is too low, loosen
thresholdgradually. - If balance is weak after matching, tighten threshold or change model/balance strategy.
- For severe class imbalance, test
balance_strategy="under"as sensitivity analysis.
Evaluation
After matching, evaluate covariate balance before causal analysis.
Categorical Covariates
cat_table = matcher.compare_categorical(return_table=True, plot_result=True)
print(cat_table)
Interpretation:
- check before/after p-value shifts
- look for reduced proportional differences after matching
Continuous Covariates
cont_table = matcher.compare_continuous(return_table=True, plot_result=True)
print(cont_table)
Interpretation:
- compare KS statistics and grouped permutation test p-values
- monitor standardized mean/median differences pre vs post matching
Single Variable Proportion Test
print(matcher.prop_test("grade"))
Troubleshooting
ValueError: numpy.dtype size changed
This is usually a NumPy/Pandas binary compatibility issue.
pip install --upgrade --force-reinstall "numpy>=1.26.4" "pandas>=2.1.4"
Restart your Python kernel/session after reinstalling.
Scores column not found
Run predict_scores() before match().
matcher.fit_scores(...)
matcher.predict_scores()
matcher.match(...)
FileNotFoundError for dataset path
Use a repo-relative path:
pd.read_csv("misc/loan.csv")
No matches found
Usually threshold is too strict or groups are weakly overlapping.
- increase
threshold - try a different
model_type - inspect score distributions with
plot_scores()
Jupyter kernel issues in notebooks
If your notebook kernel name is unavailable, switch to an existing kernel (python3) and rerun cells.
FAQ
When should I use linear vs tree vs knn?
- Start with
linearfor strong baseline interpretability. - Use
treefor nonlinear relationships and mixed feature types. - Use
knnas a local-structure baseline and compare sensitivity.
Is high model accuracy always better for matching?
Not necessarily. Very high separability may indicate weak overlap, which can reduce matchability. Balance diagnostics matter more than raw classifier accuracy.
Should I use over- or under-sampling?
over: usually keeps more majority information; good default.under: faster/smaller training sets; useful for sensitivity checks.
How do I make runs reproducible?
- set
np.random.seed(...) - keep fixed package versions
- record model/matching parameters in experiment logs
Additional resources
- Sekhon, J. S. (2011), Multivariate and propensity score matching software with automated balance optimization: The Matching package for R. Journal of Statistical Software, 42(7), 1-52. Link
- Rosenbaum, P. R., & Rubin, D. B. (1983), The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55. Link
Contributing
Contributions are welcome. Please open an issue or pull request in this repository.
License
pysmatch is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pysmatch-2.1.2.tar.gz.
File metadata
- Download URL: pysmatch-2.1.2.tar.gz
- Upload date:
- Size: 282.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68c4299dda7d7ab60042536e2f39432b47b7fdc3f96c2744e89abfe0a5b0408e
|
|
| MD5 |
ea3b138a47d0415b0390b3095645ffd0
|
|
| BLAKE2b-256 |
5a18ff417414840779ea808f3d0b56eb626f037aa6dafd5d2d07263fc6274cc6
|
File details
Details for the file pysmatch-2.1.2-py3-none-any.whl.
File metadata
- Download URL: pysmatch-2.1.2-py3-none-any.whl
- Upload date:
- Size: 28.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0734e28defa2ce7ee60435fffb47cd6d878c87a89e8e190cae991b956512cee6
|
|
| MD5 |
e3acd7f32afbb01e9b2af705230529a3
|
|
| BLAKE2b-256 |
d481df32927ec7b1bff7aa786f95bfe96d70dc3ae46859096949ee0bef076a85
|