Select, weight and analyze complex sample data

These details have not been verified by PyPI

Project links

Project description

Sample Analytics

In large-scale surveys, often complex random mechanisms are used to select samples. Estimates derived from such samples must reflect the random mechanism. Samplics is a python package that implements a set of sampling techniques for complex survey designs. These survey sampling techniques are organized into the following four sub-packages.

Sampling provides a set of random selection techniques used to draw a sample from a population. It also provides procedures for calculating sample sizes. The sampling subpackage contains:

Sample size calculation and allocation: Wald and Fleiss methods for proportions.
Equal probability of selection: simple random sampling (SRS) and systematic selection (SYS)
Probability proportional to size (PPS): Systematic, Brewer's method, Hanurav-Vijayan method, Murphy's method, and Rao-Sampford's method.

Weighting provides the procedures for adjusting sample weights. More specifically, the weighting subpackage allows the following:

Weight adjustment due to nonresponse
Weight poststratification, calibration and normalization
Weight replication i.e. Bootstrap, BRR, and Jackknife

Estimation provides methods for estimating the parameters of interest with uncertainty measures that are consistent with the sampling design. The estimation subpackage implements the following types of estimation methods:

Taylor-based, also called linearization methods
Replication-based estimation i.e. Boostrap, BRR, and Jackknife
Regression-based e.g. generalized regression (GREG)

Small Area Estimation (SAE). When the sample size is not large enough to produce reliable / stable domain level estimates, SAE techniques can be used to model the output variable of interest to produce domain level estimates. This subpackage provides Area-level and Unit-level SAE methods.

For more details, visit https://samplics-org.github.io/samplics/

Usage

Let's assume that we have a population and we would like to select a sample from it. The goal is to calculate the sample size for an expected proportion of 0.80 with a precision (half confidence interval) of 0.10.

from samplics.sampling import SampleSize

sample_size = SampleSize(parameter = "proportion")
sample_size.calculate(target=0.80, half_ci=0.10)

Furthermore, the population is located in four natural regions i.e. North, South, East, and West. We could be interested in calculating sample sizes based on region specific requirements e.g. expected proportions, desired precisions and associated design effects.

from samplics.sampling import SampleSize

sample_size = SampleSize(parameter="proportion", method="wald", stratification=True)

expected_proportions = {"North": 0.95, "South": 0.70, "East": 0.30, "West": 0.50}
half_ci = {"North": 0.30, "South": 0.10, "East": 0.15, "West": 0.10}
deff = {"North": 1, "South": 1.5, "East": 2.5, "West": 2.0}

sample_size = SampleSize(parameter = "proportion", method="Fleiss", stratification=True)
sample_size.calculate(target=expected_proportions, half_ci=half_ci, deff=deff)

To select a sample of primary sampling units using PPS method, we can use code similar to the snippets below. Note that we first use the datasets module to import the example dataset.

# First we import the example dataset
from samplics.datasets import load_psu_frame
psu_frame_dict = load_psu_frame()
psu_frame = psu_frame_dict["data"]

# Code for the sample selection
from samplics.sampling import SampleSelection

psu_sample_size = {"East":3, "West": 2, "North": 2, "South": 3}
pps_design = SampleSelection(
   method="pps-sys",
   stratification=True,
   with_replacement=False
   )

psu_frame["psu_prob"] = pps_design.inclusion_probs(
   psu_frame["cluster"],
   psu_sample_size,
   psu_frame["region"],
   psu_frame["number_households_census"]
   )

The initial weighting step is to obtain the design sample weights. In this example, we show a simple example of two-stage sampling design.

import pandas as pd

from samplics.datasets import load_psu_sample, load_ssu_sample
from samplics.weighting import SampleWeight

# Load PSU sample data
psu_sample_dict = load_psu_sample()
psu_sample = psu_sample_dict["data"]

# Load PSU sample data
ssu_sample_dict = load_ssu_sample()
ssu_sample = ssu_sample_dict["data"]

full_sample = pd.merge(
    psu_sample[["cluster", "region", "psu_prob"]],
    ssu_sample[["cluster", "household", "ssu_prob"]],
    on="cluster"
)

full_sample["inclusion_prob"] = full_sample["psu_prob"] * full_sample["ssu_prob"]
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"]

To adjust the design sample weight for nonresponse, we can use code similar to:

import numpy as np

from samplics.weighting import SampleWeight

# Simulate response
np.random.seed(7)
full_sample["response_status"] = np.random.choice(
    ["ineligible", "respondent", "non-respondent", "unknown"],
    size=full_sample.shape[0],
    p=(0.10, 0.70, 0.15, 0.05),
)
# Map custom response statuses to teh generic samplics statuses
status_mapping = {
   "in": "ineligible",
   "rr": "respondent",
   "nr": "non-respondent",
   "uk":"unknown"
   }
# adjust sample weights
full_sample["nr_weight"] = SampleWeight().adjust(
   samp_weight=full_sample["design_weight"],
   adjust_class=full_sample["region"],
   resp_status=full_sample["response_status"],
   resp_dict=status_mapping
   )

To estimate population parameters using Taylor-based and replication-based methods, we can use code similar to:

# Taylor-based
from samplics.datasets import load_nhanes2

nhanes2_dict = load_nhanes2()
nhanes2 = nhanes2_dict["data"]

from samplics.estimation import TaylorEstimator

zinc_mean_str = TaylorEstimator("mean")
zinc_mean_str.estimate(
    y=nhanes2["zinc"],
    samp_weight=nhanes2["finalwgt"],
    stratum=nhanes2["stratid"],
    psu=nhanes2["psuid"],
    remove_nan=True,
)

# Replicate-based
from samplics.datasets import load_nhanes2brr

nhanes2brr_dict = load_nhanes2brr()
nhanes2brr = nhanes2brr_dict["data"]

from samplics.estimation import ReplicateEstimator

ratio_wgt_hgt = ReplicateEstimator("brr", "ratio").estimate(
    y=nhanes2brr["weight"],
    samp_weight=nhanes2brr["finalwgt"],
    x=nhanes2brr["height"],
    rep_weights=nhanes2brr.loc[:, "brr_1":"brr_32"],
    remove_nan=True,
)

To predict small area parameters, we can use code similar to:

import numpy as np
import pandas as pd

# Area-level basic method
from samplics.datasets import load_expenditure_milk

milk_exp_dict = load_expenditure_milk()
milk_exp = milk_exp_dict["data"]

from samplics.sae import EblupAreaModel

fh_model_reml = EblupAreaModel(method="REML")
fh_model_reml.fit(
    yhat=milk_exp["direct_est"],
    X=pd.get_dummies(milk_exp["major_area"], drop_first=True),
    area=milk_exp["small_area"],
    error_std=milk_exp["std_error"],
    intercept=True,
    tol=1e-8,
)
fh_model_reml.predict(
    X=pd.get_dummies(milk_exp["major_area"], drop_first=True),
    area=milk_exp["small_area"],
    intercept=True,
)

# Unit-level basic method
from samplics.datasets import load_county_crop, load_county_crop_means

# Load County Crop sample data
countycrop_dict = load_county_crop()
countycrop = countycrop_dict["data"]
# Load County Crop Area Means sample data
countycropmeans_dict = load_county_crop_means()
countycrop_means = countycropmeans_dict["data"]

from samplics.sae import EblupUnitModel

eblup_bhf_reml = EblupUnitModel()
eblup_bhf_reml.fit(
    countycrop["corn_area"],
    countycrop[["corn_pixel", "soybeans_pixel"]],
    countycrop["county_id"],
)
eblup_bhf_reml.predict(
    Xmean=countycrop_means[["ave_corn_pixel", "ave_corn_pixel"]],
    area=np.linspace(1, 12, 12),
)

Installation

pip install samplics

Python 3.7 or newer is required and the main dependencies are numpy, pandas, scpy, and statsmodel.

Contribution

If you would like to contribute to the project, please read contributing to samplics

License

MIT

Contact

created by Mamadou S. Diallo - feel free to contact me!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.1

Feb 10, 2026

0.5.0

Jan 12, 2026

0.4.55

Aug 22, 2025

0.4.54

Aug 16, 2025

0.4.53

Aug 16, 2025

0.4.52

May 21, 2025

0.4.51

May 17, 2025

0.4.50

May 17, 2025

0.4.49

May 17, 2025

0.4.48

Mar 20, 2025

0.4.47

Mar 19, 2025

0.4.46

Mar 14, 2025

0.4.45

Mar 12, 2025

0.4.44

Feb 26, 2025

0.4.42

Feb 26, 2025

0.4.41

Feb 26, 2025

0.4.40

Feb 25, 2025

0.4.39

Feb 25, 2025

0.4.38

Feb 18, 2025

0.4.37

Feb 14, 2025

0.4.36

Feb 11, 2025

0.4.35

Feb 9, 2025

0.4.34

Jan 31, 2025

0.4.33

Jan 31, 2025

0.4.32

Jan 30, 2025

0.4.31

Jan 6, 2025

0.4.30

Jan 5, 2025

0.4.22

Jul 14, 2024

0.4.21

Jun 19, 2024

0.4.20

Jun 19, 2024

0.4.19

Jun 11, 2024

0.4.18

Jun 11, 2024

0.4.17

Jun 11, 2024

0.4.16

Jun 6, 2024

0.4.15

Jun 6, 2024

0.4.14

May 5, 2024

0.4.13

May 3, 2024

0.4.12

Apr 29, 2024

0.4.11

Dec 10, 2023

0.4.10

Aug 18, 2023

0.4.9

Aug 10, 2023

0.4.8

Jun 3, 2023

0.4.7

Jun 2, 2023

0.4.6

May 2, 2023

0.4.5

Feb 17, 2023

0.4.4

Feb 17, 2023

0.4.3

Feb 17, 2023

0.4.2

Feb 17, 2023

This version

0.4.1

Nov 3, 2022

0.4.0

Nov 1, 2022

0.3.43

Oct 23, 2022

0.3.42

Oct 19, 2022

0.3.41

Sep 25, 2022

0.3.40

Sep 17, 2022

0.3.39

Sep 17, 2022

0.3.38

Jul 12, 2022

0.3.37

Jul 12, 2022

0.3.36

Jun 19, 2022

0.3.35

May 11, 2022

0.3.34

May 11, 2022

0.3.33

May 11, 2022

0.3.32

May 11, 2022

0.3.31

May 11, 2022

0.3.30

May 11, 2022

0.3.29

May 11, 2022

0.3.28

May 11, 2022

0.3.27

May 11, 2022

0.3.26

May 11, 2022

0.3.25

May 11, 2022

0.3.24

May 11, 2022

0.3.23

May 5, 2022

0.3.22

May 5, 2022

0.3.21

May 5, 2022

0.3.20

Apr 30, 2022

0.3.19

Apr 30, 2022

0.3.18

Apr 30, 2022

0.3.17

Apr 29, 2022

0.3.16

Apr 29, 2022

0.3.15

Apr 26, 2022

0.3.14

Apr 26, 2022

0.3.13

Dec 2, 2021

0.3.12

Dec 2, 2021

0.3.11

Oct 29, 2021

0.3.10

Jul 22, 2021

0.3.9

Jul 22, 2021

0.3.8

Apr 28, 2021

0.3.7

Apr 26, 2021

0.3.6

Apr 24, 2021

0.3.5

Apr 19, 2021

0.3.4

Apr 19, 2021

0.3.3

Apr 19, 2021

0.3.2

Feb 22, 2021

0.3.1

Feb 20, 2021

0.3.0

Jan 17, 2021

0.2.6

Oct 18, 2020

0.2.5

Jun 30, 2020

0.2.4

Jun 20, 2020

0.2.3

Jun 17, 2020

0.2.2

Jun 6, 2020

0.2.1

Jun 6, 2020

0.2.0

Jun 6, 2020

0.1.1

May 31, 2020

0.1.0

May 31, 2020

0.0.15

May 24, 2020

0.0.14

May 24, 2020

0.0.13

May 24, 2020

0.0.12

May 24, 2020

0.0.11

May 23, 2020

0.0.10

May 17, 2020

0.0.9

May 16, 2020

0.0.8

May 15, 2020

0.0.7

May 15, 2020

0.0.6

May 15, 2020

0.0.5

May 13, 2020

0.0.4

Jan 29, 2020

0.0.3

Jan 26, 2020

0.0.2

Jan 19, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samplics-0.4.1.tar.gz (206.3 kB view details)

Uploaded Nov 3, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

samplics-0.4.1-py3-none-any.whl (219.7 kB view details)

Uploaded Nov 3, 2022 Python 3

File details

Details for the file samplics-0.4.1.tar.gz.

File metadata

Download URL: samplics-0.4.1.tar.gz
Upload date: Nov 3, 2022
Size: 206.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.2.2 CPython/3.10.8 Darwin/22.1.0

File hashes

Hashes for samplics-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`6a1327a48976612296c62ec08709551792cefac889f2ebba9012a290afcb2a22`
MD5	`9e56494955f11d2199fde33e5a89eda8`
BLAKE2b-256	`2fd0d1984618766622e701a3e7f18bb2a79b1483b03069c453c2ad2acfd4861e`

See more details on using hashes here.

File details

Details for the file samplics-0.4.1-py3-none-any.whl.

File metadata

Download URL: samplics-0.4.1-py3-none-any.whl
Upload date: Nov 3, 2022
Size: 219.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.2.2 CPython/3.10.8 Darwin/22.1.0

File hashes

Hashes for samplics-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`44e5dbbc7b311154d2b41ded71074e25d45c91fb986d3b1d1d2fef4165cbc51c`
MD5	`141064d358adabb69c90251f82652822`
BLAKE2b-256	`71c155cc3aa7a73dd198d4898a65b00ee0e3f41de286f9a776c01f2050543b34`

See more details on using hashes here.

samplics 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Sample Analytics

Usage

Installation

Contribution

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes