Skip to main content

Fast implementation of the stratified cluster bootstrap sampling algorithm.

Project description

FastScboot is a statistics tool to perform the stratified clustered bootstrap sampling on given data. The algorithm is fast in the sense that the remaining bottleneck to the speed of the algorithm is the speed of memory access during the inplace fancy indexing operation.

Install

pip install fast-scboot

Getting started

First import the package and initialize the Sampler object.

from fast_scboot import Sampler

s = Sampler()

Let’s create a sample data.

import numpy as np
import pandas as pd

clusts = np.asarray([0, 1, 1, 2, 0, 1, 1, 0, 2, 2])
strats = np.asarray([0, 0, 0, 0, 1, 1, 1, 2, 2, 2])
data = np.squeeze(np.dstack([strats, clusts])).astype(np.double)
data = pd.DataFrame(data, columns=['strat', 'clust'])

Two preparatory steps are preparing the data, and creating some data cache:

s.prepare_data(data, 'strat', 'clust')
s.setup_cache()

After that, you can start drawing samples:

for i in range(100):

    sampled = s.sample_data(seed=i)

How does it work?

https://github.com/mozjay0619/fast-scboot/blob/master/media/image1.png

When the prepare_data method is invoked, once the original data has been sorted by strata and cluster levels, the make_index_matrix creates three auxiliary arrays: idx_mtx, strat_arr, and clust_arr. The idx_mtx array stores information on where each cluster begins and how many rows it occupies, as well as the actual cluster value. The strat_arr is an index array that indexes the strata levels at each of the cluster level. The clust_arr does the same but for the cluster levels. The reason the values of the clust_arr are not uniformly increasing like strat_arr in this example is because internally, the unique indices are created using the Cantor pairing function for speed (and then re-cast into integer using Pandas “cateory” type).

When the sample_data method is invoked, three additional auxiliary data are created. The clust_cnt_arr array stores the number of unique cluster values in each strata, in this case, [3, 2, 2]. The total number of unique strata values is stored in the num_strats variable (3 in this case), and the same for cluster is store in the num_clusts variable (7 in this case).

https://github.com/mozjay0619/fast-scboot/blob/master/media/imageB.png

We produce a random array from [0, 1] uniform distribution with size equal to num_clusts. It’s important that we invoke random sampling function once because usually it’s very expensive to call them repeatedly. Then we use the clust_cnt_arr and loop through (vectorized using Cython) the uniform random numbers and multiply them by the values in clust_cnt_arr, and then cast them to integer datatype. We are effectively mapping the uniform random values from [0, 1] to appropriate range of integer values, which can be used as randomly bootstrap sampled indices (stored in s variable) for the idx_mtx array.

https://github.com/mozjay0619/fast-scboot/blob/master/media/image5.png

The s array is used on the idx_mtx, where we are effectively sampling with replacement clusters from each stratum (i.e. from each colored area). Once we have cluster bootstrap sampled idx_mtx, we can use the information stored in that matrix to construct the sampled_idxs array, which records indices of the sampled data in terms of the indicies of the original data. The final return value is produced by fancy indexing the original data using the sampled_idxs. The native numpy fancy indexing is somewhat costly due to data copy, so we provide our own inplace version of fancy indexing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast-scboot-0.1b4.tar.gz (183.5 kB view details)

Uploaded Source

File details

Details for the file fast-scboot-0.1b4.tar.gz.

File metadata

  • Download URL: fast-scboot-0.1b4.tar.gz
  • Upload date:
  • Size: 183.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.15

File hashes

Hashes for fast-scboot-0.1b4.tar.gz
Algorithm Hash digest
SHA256 e9b46cda039598d85662fd06e3387f9fdc3ce807f3453e7d3654b623a3a0d511
MD5 febc73f2f9cca32d589353d2fce8ffeb
BLAKE2b-256 bf6606590222c6b62263a97d374a73138be02f6c419de3c573e6cf7fa6fd48b6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page