Fast implementation of the stratified cluster bootstrap sampling algorithm.
Project description
FastScboot is a statistics tool to perform the stratified clustered bootstrap sampling on given data. The algorithm is fast in the sense that the remaining bottleneck to the speed of the algorithm is the speed of memory access during the inplace fancy indexing operation.
Install
pip install fast-scboot
Getting started
First import the package and initialize the Sampler object.
from fast_scboot import Sampler
s = Sampler()
Let’s create a sample data.
import numpy as np
import pandas as pd
clusts = np.asarray([0, 1, 1, 2, 0, 1, 1, 0, 2, 2])
strats = np.asarray([0, 0, 0, 0, 1, 1, 1, 2, 2, 2])
data = np.squeeze(np.dstack([strats, clusts])).astype(np.double)
data = pd.DataFrame(data, columns=['strat', 'clust'])
Two preparatory steps are preparing the data, and creating some data cache:
s.prepare_data(data, 'strat', 'clust')
s.setup_cache()
After that, you can start drawing samples:
for i in range(100):
sampled = s.sample_data(seed=i)
How does it work?
When the prepare_data method is invoked, once the original data has been sorted by strata and cluster levels, the make_index_matrix creates three auxiliary arrays: idx_mtx, strat_arr, and clust_arr. The idx_mtx array stores information on where each cluster begins and how many rows it occupies, as well as the actual cluster value. The strat_arr is an index array that indexes the strata levels at each of the cluster level. The clust_arr does the same but for the cluster levels. The reason the values of the clust_arr are not uniformly increasing like strat_arr in this example is because internally, the unique indices are created using the Cantor pairing function for speed (and then re-cast into integer using Pandas “cateory” type).
When the sample_data method is invoked, three additional auxiliary data are created. The clust_cnt_arr array stores the number of unique cluster values in each strata, in this case, [3, 2, 2]. The total number of unique strata values is stored in the num_strats variable (3 in this case), and the same for cluster is store in the num_clusts variable (7 in this case).
We produce a random array from [0, 1] uniform distribution with size equal to num_clusts. It’s important that we invoke random sampling function once because usually it’s very expensive to call them repeatedly. Then we use the clust_cnt_arr and loop through (vectorized using Cython) the uniform random numbers and multiply them by the values in clust_cnt_arr, and then cast them to integer datatype. We are effectively mapping the uniform random values from [0, 1] to appropriate range of integer values, which can be used as randomly bootstrap sampled indices (stored in s variable) for the idx_mtx array.
The s array is used on the idx_mtx, where we are effectively sampling with replacement clusters from each stratum (i.e. from each colored area). Once we have cluster bootstrap sampled idx_mtx, we can use the information stored in that matrix to construct the sampled_idxs array, which records indices of the sampled data in terms of the indicies of the original data. The final return value is produced by fancy indexing the original data using the sampled_idxs. The native numpy fancy indexing is somewhat costly due to data copy, so we provide our own inplace version of fancy indexing.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file fast-scboot-0.1b6.tar.gz
.
File metadata
- Download URL: fast-scboot-0.1b6.tar.gz
- Upload date:
- Size: 183.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 431d46b310bdef2c5a9011173d5b3249d4f40dedef5dfb835f3aee76b7fe4293 |
|
MD5 | 007c8cf4007b499d127744698d2dfee0 |
|
BLAKE2b-256 | 9da6306eb58cb963b35581bf1ec16b8053d94dea20c70dca998b3053b5b75c9f |