Python package for statistical inference from non-probability samples

These details have not been verified by PyPI

Project links

Home

Project description

INPS: Inference from Non-Probability Samples

Python package for statistical inference from non-probability samples.

User guide

Installation

INPS is available at the Python Package Index (PyPI).

pip install inps

Running the examples

In order to run the code included in this guide, the following imports are required.

import inps
import pandas as pd
import numpy as np
from numpy.random import default_rng
from xgboost import XGBRegressor, XGBClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.neural_network import MLPRegressor

Also, the following code creates some simulated data as assumed in the examples.

rng = default_rng(0)
pop_size = 10000
n = 1000
N = 2000
np_sample = rng.standard_normal(n * 5).reshape(-1, 5)
p_sample = rng.standard_normal(N * 3).reshape(-1, 3)
population = rng.standard_normal(pop_size * 3).reshape(-1, 3)
weights = [pop_size / N * 0.8] * int(N/2) + [pop_size / N * 1.2] * int(N/2)

def to_category(num_series):
	return pd.Series(np.where(num_series > 0, "Yes", "No"), dtype = "category", copy = False)

np_sample = pd.DataFrame(np_sample, columns = ["A", "B", "cat", "target", "target_cat"], copy = False)
p_sample = pd.DataFrame(p_sample, columns = ["A", "B", "cat"], copy = False)
population = pd.DataFrame(population, columns = ["A", "B", "cat"], copy = False)
np_sample["target_cat"] = to_category(np_sample["target_cat"])
np_sample["cat"] = to_category(np_sample["cat"])
p_sample["cat"] = to_category(p_sample["cat"])
population["cat"] = to_category(population["cat"])
p_sample["weights"] = weights

In general, np_sample and p_sample must be Pandas DataFrames. In a real example, the user has to verify the covariates. This implies making sure they have the same names in both DataFrame, same data types and same categories (if categorical).

All the code may be found in the test script. Also, check the guide script for code applying INPS to real data.

Calibration

Calibration requires a sample and its known population totals. population_totals must be a Pandas Series.

population_totals = pd.Series({"A": 10, "B": 5})

Additionally, the user must pass either the total population size...

calibration_weights = inps.calibration_weights(np_sample, population_totals, population_size = pop_size)

...or the initial weights column name.

calibration_weights2 = inps.calibration_weights(p_sample, population_totals, weights_column = "weights")

Helper methods are provided for obtaining estimations and 95% confidence intervals.

mean_estimation = inps.estimation(np_sample["target"], calibration_weights)
mean_interval = inps.confidence_interval(np_sample["target"], calibration_weights)
proportion_estimation = inps.estimation(np_sample["target_cat"] == "Yes", calibration_weights)
proportion_interval = inps.confidence_interval(np_sample["target_cat"] == "Yes", calibration_weights)

Propensity Score Adjustment

PSA requires np_sample, p_sample and population_size.

psa_weights = inps.psa_weights(np_sample, p_sample, pop_size)

The user may also pass a weights column for the p_sample.

psa_weights = inps.psa_weights(np_sample, p_sample, pop_size, weights_column = "weights")

By default, columns with the same name will be selected as covariates. This may be dangerous. It is preferable to manually select the covariates after having verified them.

psa_weights = inps.psa_weights(np_sample, p_sample, pop_size, weights_column = "weights", covariates = ["A", "B", "cat"])

By default, regularized logistic regression is applied. However, the user may choose any model supporting sample weights and compatible with the scikit-learn API.

psa_weights2 = inps.psa_weights(np_sample, p_sample, pop_size, weights_column = "weights", model = XGBClassifier(enable_categorical = True, tree_method = "hist"))

For models requiring only numerical data without missing values, make_preprocess_estimator adds some default preprocessing.

psa_weights3 = inps.psa_weights(np_sample, p_sample, pop_size, weights_column = "weights", model = inps.make_preprocess_estimator(BernoulliNB()))

The result is a dictionary with the np_sample and p_sample PSA weights as numpy arrays. The weights for the np_sample may be used for estimation as usual.

mean_estimation = inps.estimation(np_sample["target"], psa_weights["np"])
mean_interval = inps.confidence_interval(np_sample["target"], psa_weights["np"])
proportion_estimation = inps.estimation(np_sample["target_cat"] == "Yes", psa_weights["np"])
proportion_interval = inps.confidence_interval(np_sample["target_cat"] == "Yes", psa_weights["np"])

Statistical Matching

Matching requires np_sample, p_sample and target_column (from np_sample).

matching_values = inps.matching_values(np_sample, p_sample, "target")

It the target variable is categorical, a target category is required and probabilities are returned.

cat_matching_values = inps.matching_values(np_sample, p_sample, "target_cat", "Yes")

By default, columns with the same name will be selected as covariates. This may be dangerous. It is preferable to manually select the covariates after having verified them.

matching_values = inps.matching_values(np_sample, p_sample, "target", covariates = ["A", "B", "cat"])

By default, ridge regression (or regularized logistic regression for categorical values) is applied. However, the user may choose any model compatible with the scikit-learn API.

matching_values2 = inps.matching_values(np_sample, p_sample, "target", model = XGBRegressor(enable_categorical = True, tree_method = "hist"))

For models requiring only numerical data without missing values, make_preprocess_estimator adds some default preprocessing.

matching_values3 = inps.matching_values(np_sample, p_sample, "target", model = inps.make_preprocess_estimator(MLPRegressor()))

The result is a dictionary with the p_sample and np_sample imputed values (or probabilities if categorical) as numpy arrays. The values for the p_sample may be used for estimation as usual.

mean_estimation = inps.estimation(matching_values["p"], p_sample["weights"])
mean_interval = inps.confidence_interval(matching_values["p"], p_sample["weights"])
proportion_estimation = inps.estimation(cat_matching_values["p"], p_sample["weights"])
proportion_estimation = inps.confidence_interval(cat_matching_values["p"], p_sample["weights"])

Doubly robust

The parameters are analogous to matching.

doubly_robust_estimation = inps.doubly_robust_estimation(np_sample, p_sample, "target", covariates = ["A", "B", "cat"])
cat_doubly_robust_estimation = inps.doubly_robust_estimation(np_sample, p_sample, "target_cat", "Yes", covariates = ["A", "B", "cat"])

Default models are aplied. As usual, custom ones may be specified.

doubly_robust_estimation2 = inps.doubly_robust_estimation(np_sample, p_sample, "target", psa_model = XGBClassifier(enable_categorical = True, tree_method = "hist"), matching_model = XGBRegressor(enable_categorical = True, tree_method = "hist"))

The estimated mean/proportion is directly returned by the method.

Training

Training is the recommended method. The parameters and returning values are analogous to matching, except now there are 2 models the user may optionally specify.

training_values = inps.training_values(np_sample, p_sample, "target", psa_model = XGBClassifier(enable_categorical = True, tree_method = "hist"), matching_model = XGBRegressor(enable_categorical = True, tree_method = "hist"))

The imputed values for the p_sample may be used for estimation as usual.

Kernel weighting

Kernel weighting parameters are analogous to psa.

kw_weights = inps.kw_weights(np_sample, p_sample, pop_size, weights_column = "weights", covariates = ["A", "B", "cat"])

A numpy array with the estimated weights for the np_sample is returned.

proportion_estimation = inps.estimation(np_sample["target_cat"] == "Yes", kw_weights)

Working with census data

The exact same methods can be used when the "probabilistic sample" includes the whole population instead.

imputed_values = inps.training_values(np_sample, population, "target")
cat_imputed_values = inps.training_values(np_sample, population, "target_cat", "Yes")
mean_estimation = inps.estimation(imputed_values["p"])
mean_interval = inps.confidence_interval(imputed_values["p"])
proportion_estimation = inps.estimation(cat_imputed_values["p"])
proportion_interval = inps.confidence_interval(cat_imputed_values["p"])

Advanced models

inps.boosting_classifier() and inps.boosting_regressor() will return advanced Gradient Boosting estimators ready to use for optimal results.

Project details

These details have not been verified by PyPI

Project links

Home

Release history Release notifications | RSS feed

1.20

Mar 17, 2025

1.19

Nov 17, 2024

1.18

Nov 12, 2024

1.17

Nov 12, 2024

This version

1.16

Nov 10, 2024

1.15

Nov 10, 2024

1.14

Nov 8, 2024

1.13

Nov 8, 2024

1.12

Oct 29, 2024

1.11

Oct 3, 2024

1.10

Oct 3, 2024

1.9

Oct 3, 2024

1.8

Oct 3, 2024

1.7

Oct 2, 2024

1.6

Oct 1, 2024

1.5

Oct 1, 2024

1.4

Oct 1, 2024

1.3

Oct 1, 2024

1.2

Sep 30, 2024

1.1.3

Sep 30, 2024

1.1.2

Jul 29, 2024

1.1.1

Jul 29, 2024

1.1

May 21, 2024

1.0

Mar 11, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inps-1.16.tar.gz (290.6 kB view details)

Uploaded Nov 10, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inps-1.16-py2.py3-none-any.whl (18.1 kB view details)

Uploaded Nov 10, 2024 Python 2Python 3

File details

Details for the file inps-1.16.tar.gz.

File metadata

Download URL: inps-1.16.tar.gz
Upload date: Nov 10, 2024
Size: 290.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.32.3

File hashes

Hashes for inps-1.16.tar.gz
Algorithm	Hash digest
SHA256	`419555797bfcbaf1bf0ba249b5408816967cedb59588b0aede94741b7d6370c1`
MD5	`0213266aa2b1f883807b177b21fd52e8`
BLAKE2b-256	`b36364f39f15faf5d788bdaf9d68533d1514ac3adc5d059106a408d7c2a4ce47`

See more details on using hashes here.

File details

Details for the file inps-1.16-py2.py3-none-any.whl.

File metadata

Download URL: inps-1.16-py2.py3-none-any.whl
Upload date: Nov 10, 2024
Size: 18.1 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.32.3

File hashes

Hashes for inps-1.16-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e0c7ef3e7d0820b6a3bfd21f0f1a99363e317067bb52f98fdb6db782d1a5379`
MD5	`f4872a575ad51202f11f42d0f84cd9a4`
BLAKE2b-256	`c82ecf7d7c67d70f5928d33c5066cc7b923956e68e25937a50b5b6ef834ed890`

See more details on using hashes here.

inps 1.16

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

INPS: Inference from Non-Probability Samples

User guide

Installation

Running the examples

Calibration

Propensity Score Adjustment

Statistical Matching

Doubly robust

Training

Kernel weighting

Working with census data

Advanced models

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes