A robust framework for generating synthetic categorical datasets for evaluation or testing purposes.

These details have not been verified by PyPI

Project description

Categorical Classification

A robust framework for generating synthetic categorical datasets for evaluation or testing purposes.

Usage

Creating a simple dataset

# Creates a simple dataset of 10 features, 10k samples, with feature cardinality of all features being 35
X = cc.generate_data(9, 
                     10000, 
                     cardinality=35, 
                     ensure_rep=True, 
                     random_values=True, 
                     low=0, 
                     high=40)

# Creates target labels via clustering
y = cc.generate_labels(X, n=2, class_relation='cluster')

Documentation

CategoricalClassification.dataset_info

print(CategoricalClassification.dataset_info)

Stores a formatted dictionary of operations made. Function CategoricalClassification.generate_data resets its contents. Each subsequent function call adds information to it.

CategoricalClassification.generate_data

CategoricalClassification.generate_data(n_features, 
                                        n_samples, 
                                        cardinality=5, 
                                        structure=None, 
                                        ensure_rep=False, 
                                        random_values=False, 
                                        low=0, 
                                        high=1000,
                                        k=10,
                                        seed=42)

Generates dataset of shape (n_samples, n_features), based on given parameters.

n_features: int The number of features in a generated dataset.
n_samples: int The number of samples in a generated dataset.
cardinality: int, default=5. Sets the default cardinality of a generated dataset.
structure: list, numpy.ndarray, default=None. Sets the structure of a generated dataset. Offers more controle over feature value domains and value distributions. Follows the format [tuple, tuple, ...], where:
- tuple can either be:
  - (int or list, int): the first element represents the index or list of indexes of features. The second element their cardinality. Generated features will have a roughly normal density distribution of values, with a randomly selected value as a peak. The feature values will be integers, in range [0, second element of tuple].
  - (int or list, list): the first element represents the index or list of indexes of features. The second element offers two options:
    - list: a list of values to be used in the feature or features,
    - [list, list]: where the first list element represents a set of values the feature or features posses, the second the frequencies or probabilities of individual features.
ensure_rep: bool, default=False: Control flag. If True, all possible values will appear in the feature.
random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
low: int Sets lower bound of value domain of feature.
high: int Sets upper bound of value domain of feature. Only used when random_values is True.
k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
seed: int, default=42. Controls numpy.random.seed

Returns: a numpy.ndarray dataset with n_features features and n_samples samples.

CategoricalClassification._configure_generate_feature

CategoricalClassification._feature_builder(feature_attributes, 
                                           n_samples, 
                                           ensure_rep=False, 
                                           random_values=False, 
                                           low=0, 
                                           high=1000,
                                           k=10)

Helper function used to configure _generate_feature() with proper parameters based on feature_atributes.

feature_attributes: int or list or numpy.ndarray Attributes of feature. Can be just cardinality (int), value domain (list), or value domain and their respective probabilities (list).
n_samples: int Number of samples in dataset. Determines generated feature vector size.
ensure_rep: bool, default=False: Control flag. If True, all possible values will appear in the feature.
random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
low: int Sets lower bound of value domain of feature.
high: int Sets upper bound of value domain of feature. Only used when random_values is True.
k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak. Returns: a numpy.ndarray feature array.

CategoricalClassification._generate_feature

CategoricalClassification._generate_feature(size, 
                                            vec=None, 
                                            cardinality=5, 
                                            ensure_rep=False, 
                                            random_values=False, 
                                            low=0, 
                                            high=1000,
                                            k=10,
                                            p=None)

Generates feature array of length size. Called by CategoricalClassification.generate_data, by utilizing numpy.random.choice. If no probabilites array is given, the value density of the generated feature array will be roughly normal, with a randomly chosen peak. The peak will be chosen from the value array.

size: int Length of generated feature array.
vec: list or numpy.ndarray, default=None List of feature values, value domain of feature.
cardinality: int, default=5 Cardinality of feature to use when generating its value domain. If vec is not None, vec is used instead.
ensure_rep: bool, default=False Control flag. If True, all possible values will appear in the feature array.
random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
low: int Sets lower bound of value domain of feature.
high: int Sets upper bound of value domain of feature. Only used when random_values is True.
- k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
p: list or numpy.ndarray, default=None Array of frequencies or probabilities. Must be of length v or equal to the length of v.

Returns: a numpy.ndarray feature array.

CategoricalClassification.generate_combinations

CategoricalClassification.generate_combinations(X, 
                                                feature_indices, 
                                                combination_function=None, 
                                                combination_type='linear')

Generates and adds a new column to given dataset X. The column is the result of a combination of features selected with feature_indices. Combinations can be linear, nonlinear, or custom defined functions.

X: list or numpy.ndarray: Dataset to perform the combinations on.
feature_indices: list or numpy.ndarray: List of feature (column) indices to be combined.
combination_function: function, default=None: Custom or user-defined combination function. The function parameter must be a list or numpy.ndarray of features to be combined. The function must return a list or numpy.ndarray column or columns, to be added to given dataset X using numpy.column_stack.
combination_type: str either linear or nonlinear, default='linear': Selects which built-in combination type is used.
- If 'linear', the combination is a sum of selected features.
- If 'nonlinear', the combination is the sine value of the sum of selected features.

Returns: a numpy.ndarray dataset X with added feature combinations.

CategoricalClassification._xor

CategoricalClassification._xor(arr)

Performs bitwise XOR on given vectors and returns result.

arr: list or numpy.ndarray List of features to perform the combination on.

Returns: a numpy.ndarray result of numpy.bitwise_xor(a,b) on given columns in arr.

CategoricalClassification._and

CategoricalClassification._and(arr)

Performs bitwise AND on given vectors and returns result.

arr: list or numpy.ndarray List of features to perform the combination on.

Returns: a numpy.ndarray result of numpy.bitwise_and(a,b) on given columns in arr.

CategoricalClassification._or

CategoricalClassification._or(arr)

Performs bitwise OR on given vectors and returns result.

arr: list or numpy.ndarray List of features to perform the combination on.

Returns: a numpy.ndarray result of numpy.bitwise_or(a,b) on given columns in arr.

CategoricalClassification.generate_correlated

CategoricalClassification.generate_correlated(X, 
                                              feature_indices, 
                                              r=0.8)

Generates and adds new columns to given dataset X, correlated to the selected features, by a Pearson correlation coefficient of r. For vectors with mean 0, their correlation equals the cosine of their angle.

X: list or numpy.ndarray: Dataset to perform the combinations on.
feature_indices: int or list or numpy.ndarray: Index of feature (column) or list of feature (column) indices to generate correlated features to.
r: float, default=0.8: Desired correlation coefficient.

Returns: a numpy.ndarray dataset X with added correlated features.

CategoricalClassification.generate_duplicates

CategoricalClassification.generate_duplicates(X, 
                                              feature_indices)

Duplicates selected feature (column) indices, and adds the duplicated columns to the given dataset X.

X: list or numpy.ndarray: Dataset to perform the combinations on.
feature_indices: int or list or numpy.ndarray: Index of feature (column) or list of feature (column) indices to duplicate.

Returns: a numpy.ndarray dataset X with added duplicated features.

CategoricalClassification.generate_labels

CategoricalClassification.generate_nonlinear_labels(X, 
                                                    n=2, 
                                                    p=0.5, 
                                                    k=2, 
                                                    decision_function=None, 
                                                    class_relation='linear', 
                                                    balance=False)

Generates a vector of labels. Labels are (currently) generated as either a linear, nonlinear, or custom defined function. It generates classes using a decision boundary generated by the linear, nonlinear, or custom defined function.

X: list or numpy.ndarray: Dataset to generate labels for.
n: int, default=2: Number of classes.
p: float or list, default=0.5: Class distribution.
k: int or float, default=2: Constant to be used in the linear or nonlinear combination used to set class values.
decision_function: function, default: None Custom defined function to use for setting class values. Must accept dataset X as input and return a list or numpy.ndarray decision boundary.
class_relation: str, either 'linear', 'nonlinear', or 'cluster' default='linear': Sets relationship type between class label and sample, by calculating a decision boundary with linear or nonlinear combinations of features in X, or by clustering the samples in X.
balance: boolean, default=False: Whether to naievly balance clusters generated by KMeans clustering.

Returns: numpy.ndarray y of class labels.

CategoricalClassification._cluster_data

CategoricalClassification._cluster_data(X, 
                                        n, 
                                        p=1.0, 
                                        balance=False)

Clusters given data using KMeans clustering.

X: list or numpy.ndarray: Dataset to cluster.
n: int: Number of clusters.
p: float or list or numpy.ndarray: To be used when balance=True, sets class distribution - number of samples per cluster.
balance: boolean, default=False: Whether to naievly balance clusters generated by KMeans clustering.

Returns: numpy.ndarray cluster_labels of clustering labels.

CategoricalClassification.generate_noise

CategoricalClassification.generate_noise(X, 
                                         y, 
                                         p=0.2, 
                                         type="categorical", 
                                         missing_val=float('-inf'))

Generates categorical noise or simulates missing data on a given dataset.

X: list or numpy.ndarray: Dataset to generate noise for.
y: list or numpy.ndarray: Labels of samples in dataset X. Required for generating categorical noise.
p: float, p <=1.0, default=0.2: Amount of noise to generate.
type: str, either "categorical" or "missing", default="categorical": Type of noise to generate.
missing_val: default=float('-inf'): Value to simulate missing values with. Non-numerical values may cause issues with algorithms unequipped to handle them.

Returns: numpy.ndarray X with added noise.

CategoricalClassification.downsample_dataset

CategoricalClassification.downsample_dataset(X, 
                                             y, 
                                             n=None, 
                                             seed=42, 
                                             reshuffle=False):

Downsamples given dataset according to N or the number of samples in minority class, resulting in a balanced dataset.

X: list or numpy.ndarray: Dataset to downsample.
y: list or numpy.ndarray: Labels corresponding to X.
N: int, optional: Optional number of samples per class to downsample to.
seed: int, default=42: Seed for random state of resample function.
reshuffle: boolean, default=False: Reshuffle the dataset after downsampling.

Returns: Balanced, downsampled numpy.ndarray X and numpy.ndarray y.

CategoricalClassification.print_dataset

CategoricalClassification.print_dataset(X, y)

Prints given dataset in a readable format.

X: list or numpy.ndarray: Dataset to print.
y: list or numpy.ndarray: Class labels corresponding to samples in given dataset.

CategoricalClassification.summarize

CategoricalClassification.summarize()

Prints stored dataset information dictionary in a digestible manner.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.3

Jul 30, 2024

0.1.2

Jul 25, 2024

0.1.1

Jul 24, 2024

This version

0.1.0

Jul 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catclass-0.1.0.tar.gz (16.4 kB view details)

Uploaded Jul 24, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

catclass-0.1.0-py3-none-any.whl (13.2 kB view details)

Uploaded Jul 24, 2024 Python 3

File details

Details for the file catclass-0.1.0.tar.gz.

File metadata

Download URL: catclass-0.1.0.tar.gz
Upload date: Jul 24, 2024
Size: 16.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.11

File hashes

Hashes for catclass-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`12b434b1cbf7b1bcb315206b6a70524c4347fb5e6c9daa98c2616dc5eb9aae39`
MD5	`5fd87be36b0b69c7d2d49120c5d6140e`
BLAKE2b-256	`f85e1b87ee83af17f156a302e0450780882e31d5cb8e47438ff2dfc486f8c01d`

See more details on using hashes here.

File details

Details for the file catclass-0.1.0-py3-none-any.whl.

File metadata

Download URL: catclass-0.1.0-py3-none-any.whl
Upload date: Jul 24, 2024
Size: 13.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.11

File hashes

Hashes for catclass-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`25d18f41a355e2a7738696bf1e797b9ca7990843d5c3331060e4bb6e7fbdd78f`
MD5	`454df88712849a95733816dba164f7cd`
BLAKE2b-256	`83c17dca240c494b7121b901187f50d360e76d7efab1101332f29b282439c4f6`

See more details on using hashes here.

catclass 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Categorical Classification

Usage

Creating a simple dataset

Documentation

CategoricalClassification.dataset_info

CategoricalClassification.generate_data

CategoricalClassification._configure_generate_feature

CategoricalClassification._generate_feature

CategoricalClassification.generate_combinations

CategoricalClassification._xor

CategoricalClassification._and

CategoricalClassification._or

CategoricalClassification.generate_correlated

CategoricalClassification.generate_duplicates

CategoricalClassification.generate_labels

CategoricalClassification._cluster_data

CategoricalClassification.generate_noise

CategoricalClassification.downsample_dataset

CategoricalClassification.print_dataset

CategoricalClassification.summarize

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes