Distribution based sampling for multiple categorical columms
Project description
MULTI COLUMN DISTRIBUTION SAMPLER
Function to draw a sample from a given dataframe while maintaining the same distribution of the columns of interest. Also includes a function to determine the minimum sample size needed to ensure the distribution of the columns of interest in the given dataframe.
Overview
In order to draw a sample from a pandas dataframe
we can use the sample
function. But, this doesn't ensure that the sample drawn would be representative of the distributions of the columns of interest in the dataframe. Although we can use the train_test_split
with stratify
from sklearn.model_selection
for stratified sampling, things get complex once we want the sample to follow the exact distribution for multiple columns. This package abstracts away all the logic and provides you with functions that can be used to determine the minimum sample size reqired for a distributive sample and also the sample itself when involving multiple columns. Since this package ensures a perfect representative sample, the resulting sample will probably be a bit larger compared to the sample size desired. This increase in sample size will be depending on the distributions of the columns in the actual dataframe and also the number of columns factored in while sampling.
Features
- Multi column based representative sampling
- Handles both continuous and categorical features
- Uses Gini index to measure impurity of partion
- Includes basic tree printing functionality for tree visualization
Requirements
Python 3.x [Optional] Any text editor or IDE of your choice for editing the code.
Installation
multi_column_distribution_sampler can be installed using the following command:
pip install multi_column_distribution_sampler
or
pip3 install multi_column_distribution_sampler
Dependencies
multi_column_distribution_sampler depends on the following packages:-
- numpy
- pandas
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for multi_column_distribution_sampler-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 789713d49239424ad78f4d40ae56a86667c14db13c06645720d1794dacb6667c |
|
MD5 | 61c5c3357a25ef61948e852062c0ff3c |
|
BLAKE2b-256 | c30304aeba897a28d9f71ed553884aa819245f67ab393cf7ab8e4789ceef5b0c |
Hashes for multi_column_distribution_sampler-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 76a8db0333fb3e8e8e8d69a32b03f68ee1a7f1653aaff90b080f677ac48f6e74 |
|
MD5 | d4c5d9d82f4d0eccce6abb3a0dc21bd0 |
|
BLAKE2b-256 | ea68b087f0fef34127dec308c9e19a45f933925382d3bf9f50b37afdf66e3589 |