Skip to main content

No project description provided

Project description

choosy

PyPI - Version PyPI - Python Version


Table of Contents

Installation

pip install choosy

Usage

Sampling

choosy is designed to make it easy to do selective sampling from a Pandas dataframe. The main work is handled by the StructuredSampler class, which holds a dataframe and optional information about informative columns.

import pandas as pd
import choosy

df = pd.DataFrame(
        {
            "category_col": ["a"]*5 + ["b"]*5,
            "value_col": [1, 2, 3, 4, 5]*2,
        }
)

sampler = choosy.StructuredSampler(
    df,
)

The most simple case is that you want to sample n rows from a dataframe. Frankly, in that simple case you should just use the native df.sample() method, but choosy can do that too:

sampler.sample_data(n_sample=3)

will return 3 random rows.

The real use of choosy comes from the ability to sample rows based on values in a column. Let's say we want to sample 2 rows with value a in category_col and 1 row with value b. Instead of an integer for n_sample, we use a dictionary mapping values to the number of samples from rows with that value. In addition, we define a bin_column that will be the column in which to find those values.

sampler.sample_data(n_sample={'a': 2, 'b': 1}, bin_column="category_col")

Now you will get three rows, one with value b and two with value a. Note that it will only sample from those values specified, which may mean that some rows with values other than keys in the dictionary are never potentially sampled. If you specify a bin_column but n_sample is an integer, it will sample that many values from each unique value in the bin_column.

If you want to sample based on the unique combination of values in multiple columns, you can specify a list of bin_columns and keys can be a tuple of sample selection values in the same order. Alternatively, you can use a pandas Series whose index is the sample selection values and whose elements are the number of samples to take from that value. The second case is useful if you have, for example, a dataframe with observations and second baseline dataframe that you want to sample from with the same frequencies based on some value. Here, you can use the groupby method to get the number of samples to take from each value.

n_sample = df_obs.groupby(['selection_column_1', 'selection_column_2'])['selection_column_2'].count()]
sampler = choosy.StructuredSampler(df_baseline)
df_sample = sampler.sample_data(
    n_sample=n_sample,
    bin_column=['selection_column_1', 'selection_column_2'],
)

Counting and Repeat Sampling

In all cases above, the sample_data method returns a dataframe with the sampled rows. In many applications, the main thing you want to know is the number of observations of some other value in the dataframe. By adding a count_column argument, to any of the situations described above, you get a dataframe with the sampled counts of the values in that column. This is fully equivalent to doing a groupby().count() on the results of the sample_data method.

```python
df = pd.DataFrame(
        {
            "category_col": ["a"]*5 + ["b"]*5,
            "value_col": [1, 2, 3, 4, 5]*2,
        }
)

sampler = choosy.StructuredSampler(
    df,
    count_column='value_col',
)

# Sample and get the list of observations of each value from category_col.
sampler.sample_data(
    n_sample={'a': 2, 'b': 1},
    count_column='value_col',
    bin_column='category_col',
)

To get n_repeat sample distributions of a given count variable, you can use the sample_repeat method. You are required to specify a count_column for this method, and you can specify a bin_column and n_sample as described above. For example, to do 1000 repeats of the sampling above:

sampler.sample_repeat(
    n_repeat=1000,
    n_sample={'a': 2, 'b': 1},
    count_column='value_col',
    bin_column='category_col',
)

Notes

  • If you plan to always sample using the same bin_column, you can also specify this value in the StructuredSampler constructor. This will save you from having to specify it in every call to sample_data.

  • A count column can now be specified in the StructuredSampler initialization for simplicity.

  • You can also specify a weight_column in either the constructor or the sample_data/sample_repeat methods to weight the sampling by the values in that column, which is passed through the pd.sample.

  • choosy currently requires sampling with replacement.

  • A seed can be specified in the StructuredSampler initialization to make sampling reproducible.

  • As of version 0.0.4, sampling has been optimized for speed by using numpy.random.choose on an array of indices.

License

choosy is distributed under the terms of the BSD 3-Clause license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

choosy-0.0.5.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

choosy-0.0.5-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file choosy-0.0.5.tar.gz.

File metadata

  • Download URL: choosy-0.0.5.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.24.1

File hashes

Hashes for choosy-0.0.5.tar.gz
Algorithm Hash digest
SHA256 906536d261461a61259ae3915654f1a150be938d730637e46c8969bb4f3098e3
MD5 eaa2f167bd1fbdd0a39a1f2dd532f2d9
BLAKE2b-256 59bfcb040625b6f911c0cc8bb7f050061506555e20a29482f378669f3109979c

See more details on using hashes here.

File details

Details for the file choosy-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: choosy-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.24.1

File hashes

Hashes for choosy-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a0cb4c27eb548841ba62bc2a6a8e95d8dabd401e2058f2bd9dc0b85df92d24f2
MD5 70869d8b4eb242b1c01b12ad042b4bc5
BLAKE2b-256 4678dbe52eed2c35a67e4cc45d800f5d07d05f85e9d56379485793613cea2502

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page