Sampling library for Python
csample provides pseudo-random sampling methods applicable when the size of population is unknown:
Hash-based sampling is a filtering method that tries to approximate random sampling by using a hash function as a selection criterion.
Following list describes some features of the method:
Here are some real and hypothetical applications:
csample provides two sampling functions for a convenience.
sample_line() accepts iterable type containing strs:
data = [ 'alan', 'brad', 'cate', 'david', ] samples = csample.sample_line(data, 0.5)
sample_tuple() expects tuples instead of strs as a content of iterable. The third argument 0 indicates a column index:
data = [ ('alan', 10, 5), ('brad', 53, 7), ('cate', 12, 6), ('david', 26, 5), ] samples = csample.sample_tuple(data, 0.5, 0)
In both cases, the function returns immediately with sampled iterable.
Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number.
You can specify random seed to perform reproducible sampling.
For more information, read Wikipedia
csample provides single function for reservoir sampling:
data = [ 'alan', 'brad', 'cate', 'david', ] samples = csample.reservoir(data, 2)
Resulting samples contains two elements randomly choosen from given data.
Note that the function doesn’t return a generator but list, and also won’t finish until it consume the entire input stream.
Also note that, by default, reservoir sampling doesn’t preserve order of original list which means that following assertion holds in general:
population = [0, 1, 2, 3, 4, 5] samples = csample.reservoir(population, 3) assert sorted(samples) != samples
To maintain original order, provide keep_order=True parameter:
population = [0, 1, 2, 3, 4, 5] samples = csample.reservoir(population, 3, keep_order=True) assert sorted(samples) == samples
Read the full API documentation.
csample also provides command-line interface.
Following command prints 50% sample from 100 integers:
> seq 100 | csample -r 0.5
To see more options use --help command-line argument:
> csample --help
In order to obtain fairly random/unbiased sample, it is critical to use suitable hash function.
There could be many criteria such as avalanche effect. For those who are interested, see link below:
Installing csample is easy:
pip install csample
or download the source and run:
python setup.py install