Propensity Score Matching (PSM) on Python

Project description

`pysmatch`

GitHub License

Propensity Score Matching (PSM) helps reduce selection bias in observational studies by matching treatment and control units with similar propensity scores.

pysmatch is an improved and extended version of pymatch, with modernized modeling, modularized matching utilities, and better support for reproducible workflows.

Multilingual

English | 中文

Highlights

Multiple score models: Logistic Regression, KNN, CatBoost
Flexible balancing: oversampling and undersampling (balance_strategy)
Standard and exhaustive matching workflows
Balance diagnostics for categorical and continuous covariates
Optional Optuna tuning for automated model search

Installation

Install from PyPI:

pip install pysmatch

Install optional extras:

pip install "pysmatch[tree]"   # CatBoost support
pip install "pysmatch[tune]"   # Optuna support
pip install "pysmatch[all]"    # all optional dependencies

Install from source:

git clone https://github.com/miaohancheng/pysmatch.git
cd pysmatch
pip install -e ".[all]"

Quickstart

This minimal example runs the full core path with the built-in demo dataset (misc/loan.csv).

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from pysmatch.Matcher import Matcher

np.random.seed(42)
data = pd.read_csv("misc/loan.csv")

test = data[data.loan_status == "Default"].copy()
control = data[data.loan_status == "Fully Paid"].copy()

matcher = Matcher(
    test=test,
    control=control,
    yvar="is_default",
    exclude=["loan_status"],
)

matcher.fit_scores(
    balance=True,
    balance_strategy="over",
    nmodels=10,
    model_type="linear",
    n_jobs=2,
)
matcher.predict_scores()
matcher.match(method="min", nmatches=1, threshold=0.001, replacement=False)

print(matcher.matched_data.head())

If this works, continue to the full workflow below.

End-to-End Workflow

Data Preparation

Use domain-relevant covariates and avoid leaking post-treatment variables into matching features.

import pandas as pd

fields = [
    "loan_amnt",
    "funded_amnt",
    "funded_amnt_inv",
    "term",
    "int_rate",
    "installment",
    "grade",
    "sub_grade",
    "loan_status",
]

raw = pd.read_csv("misc/loan.csv", usecols=fields)
test = raw[raw.loan_status == "Default"].copy()
control = raw[raw.loan_status == "Fully Paid"].copy()

Initialize Matcher

from pysmatch.Matcher import Matcher

matcher = Matcher(
    test=test,
    control=control,
    yvar="is_default",
    exclude=["loan_status"],
)

print("xvars:", matcher.xvars)
print("test/control:", matcher.testn, matcher.controln)

Fit Propensity Score Models

fit_scores supports three model types:

linear (logistic regression)
knn
tree (CatBoost, requires pysmatch[tree])

matcher.fit_scores(
    balance=True,
    balance_strategy="over",   # "over" or "under"
    nmodels=10,
    model_type="linear",
    max_iter=200,
    n_jobs=2,
)

print("models:", len(matcher.models))
print("avg validation accuracy:", sum(matcher.model_accuracy) / len(matcher.model_accuracy))

Optuna path (single tuned model):

# matcher.fit_scores(
#     balance=True,
#     model_type="tree",
#     use_optuna=True,
#     n_trials=20,
# )

Predict and Plot Scores

matcher.predict_scores()
matcher.plot_scores()

matcher.data now contains a scores column.

Tune Threshold

import numpy as np

matcher.tune_threshold(
    method="min",
    nmatches=1,
    rng=np.arange(0.0001, 0.0051, 0.0005),
)

Choose a threshold that balances quality and retained sample size.

Run Matching

Standard matching:

matcher.match(
    method="min",
    nmatches=1,
    threshold=0.001,
    replacement=False,
    exhaustive_matching=False,
)
matcher.plot_matched_scores()

Exhaustive matching:

matcher.match(
    threshold=0.001,
    nmatches=1,
    exhaustive_matching=True,
)

Review Matched Data and Weights

print(matcher.matched_data.head())
print(matcher.record_frequency().head())
matcher.assign_weight_vector()
print(matcher.matched_data[["record_id", "match_id", "weight"]].head())

Matching Strategies

Standard vs Exhaustive Matching

Standard (exhaustive_matching=False): uses nearest-neighbor style control selection with configurable method/replacement behavior.
Exhaustive (exhaustive_matching=True): prioritizes wider control utilization while still respecting threshold constraints.

Key Parameters

threshold: max allowed score distance
nmatches: controls per treated unit
replacement: whether a control can be reused
method: "min" (closest) or "random" (random within threshold)

Practical Guidance

Start with nmatches=1, replacement=False, and a moderate threshold.
If retention is too low, loosen threshold gradually.
If balance is weak after matching, tighten threshold or change model/balance strategy.
For severe class imbalance, test balance_strategy="under" as sensitivity analysis.

Evaluation

After matching, evaluate covariate balance before causal analysis.

Categorical Covariates

cat_table = matcher.compare_categorical(return_table=True, plot_result=True)
print(cat_table)

Interpretation:

check before/after p-value shifts
look for reduced proportional differences after matching

Continuous Covariates

cont_table = matcher.compare_continuous(return_table=True, plot_result=True)
print(cont_table)

Interpretation:

compare KS statistics and grouped permutation test p-values
monitor standardized mean/median differences pre vs post matching

Single Variable Proportion Test

print(matcher.prop_test("grade"))

Troubleshooting

`ValueError: numpy.dtype size changed`

This is usually a NumPy/Pandas binary compatibility issue.

pip install --upgrade --force-reinstall "numpy>=1.26.4" "pandas>=2.1.4"

Restart your Python kernel/session after reinstalling.

`Scores column not found`

Run predict_scores() before match().

matcher.fit_scores(...)
matcher.predict_scores()
matcher.match(...)

`FileNotFoundError` for dataset path

Use a repo-relative path:

pd.read_csv("misc/loan.csv")

No matches found

Usually threshold is too strict or groups are weakly overlapping.

increase threshold
try a different model_type
inspect score distributions with plot_scores()

Jupyter kernel issues in notebooks

If your notebook kernel name is unavailable, switch to an existing kernel (python3) and rerun cells.

FAQ

When should I use `linear` vs `tree` vs `knn`?

Start with linear for strong baseline interpretability.
Use tree for nonlinear relationships and mixed feature types.
Use knn as a local-structure baseline and compare sensitivity.

Is high model accuracy always better for matching?

Not necessarily. Very high separability may indicate weak overlap, which can reduce matchability. Balance diagnostics matter more than raw classifier accuracy.

Should I use over- or under-sampling?

over: usually keeps more majority information; good default.
under: faster/smaller training sets; useful for sensitivity checks.

How do I make runs reproducible?

set np.random.seed(...)
keep fixed package versions
record model/matching parameters in experiment logs

Additional resources

Sekhon, J. S. (2011), Multivariate and propensity score matching software with automated balance optimization: The Matching package for R. Journal of Statistical Software, 42(7), 1-52. Link
Rosenbaum, P. R., & Rubin, D. B. (1983), The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55. Link

Contributing

Contributions are welcome. Please open an issue or pull request in this repository.

License

pysmatch is released under the MIT License.

Project details

Release history Release notifications | RSS feed

This version

2.1.2

Mar 8, 2026

2.1.1

Feb 25, 2026

2.1.0

Feb 25, 2026

2.0

May 29, 2025

1.9

May 25, 2025

1.8

May 25, 2025

1.6

Apr 2, 2025

1.5

Apr 2, 2025

1.4

Apr 2, 2025

1.3

Apr 2, 2025

1.2

Mar 9, 2025

1.1

Jan 16, 2025

1.0

Jan 6, 2025

0.9

Dec 18, 2024

0.8

Nov 24, 2024

0.7

Nov 13, 2024

0.6

Nov 10, 2024

0.5

Nov 8, 2024

0.4

Oct 31, 2024

0.3

Sep 19, 2024

0.2

Sep 19, 2024

0.1

Sep 18, 2024

0.0.0.6

Sep 18, 2024

0.0.0.5

Apr 19, 2021

0.0.0.4

Apr 19, 2021

0.0.0.3

Apr 19, 2021

0.0.0.2

Apr 19, 2021

0.0.0.1

Apr 19, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysmatch-2.1.2.tar.gz (282.9 kB view details)

Uploaded Mar 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pysmatch-2.1.2-py3-none-any.whl (28.1 kB view details)

Uploaded Mar 8, 2026 Python 3

File details

Details for the file pysmatch-2.1.2.tar.gz.

File metadata

Download URL: pysmatch-2.1.2.tar.gz
Upload date: Mar 8, 2026
Size: 282.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for pysmatch-2.1.2.tar.gz
Algorithm	Hash digest
SHA256	`68c4299dda7d7ab60042536e2f39432b47b7fdc3f96c2744e89abfe0a5b0408e`
MD5	`ea3b138a47d0415b0390b3095645ffd0`
BLAKE2b-256	`5a18ff417414840779ea808f3d0b56eb626f037aa6dafd5d2d07263fc6274cc6`

See more details on using hashes here.

File details

Details for the file pysmatch-2.1.2-py3-none-any.whl.

File metadata

Download URL: pysmatch-2.1.2-py3-none-any.whl
Upload date: Mar 8, 2026
Size: 28.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for pysmatch-2.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0734e28defa2ce7ee60435fffb47cd6d878c87a89e8e190cae991b956512cee6`
MD5	`e3acd7f32afbb01e9b2af705230529a3`
BLAKE2b-256	`d481df32927ec7b1bff7aa786f95bfe96d70dc3ae46859096949ee0bef076a85`

See more details on using hashes here.

pysmatch 2.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

pysmatch

Multilingual

Highlights

Installation

Quickstart

End-to-End Workflow

Data Preparation

Initialize Matcher

Fit Propensity Score Models

Predict and Plot Scores

Tune Threshold

Run Matching

Review Matched Data and Weights

Matching Strategies

Standard vs Exhaustive Matching

Key Parameters

Practical Guidance

Evaluation

Categorical Covariates

Continuous Covariates

Single Variable Proportion Test

Troubleshooting

ValueError: numpy.dtype size changed

Scores column not found

FileNotFoundError for dataset path

No matches found

Jupyter kernel issues in notebooks

FAQ

When should I use linear vs tree vs knn?

Is high model accuracy always better for matching?

Should I use over- or under-sampling?

How do I make runs reproducible?

Additional resources

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`pysmatch`

`ValueError: numpy.dtype size changed`

`Scores column not found`

`FileNotFoundError` for dataset path

When should I use `linear` vs `tree` vs `knn`?