Skip to main content

A robust framework for generating synthetic categorical datasets for evaluation or testing purposes.

Project description

Categorical Classification

A robust framework for generating synthetic categorical datasets for evaluation or testing purposes.

Usage


Creating a simple dataset

# Creates a simple dataset of 10 features, 10k samples, with feature cardinality of all features being 35
X = cc.generate_data(9, 
                     10000, 
                     cardinality=35, 
                     ensure_rep=True, 
                     random_values=True, 
                     low=0, 
                     high=40)

# Creates target labels via clustering
y = cc.generate_labels(X, n=2, class_relation='cluster')

Documentation


CategoricalClassification.dataset_info

print(CategoricalClassification.dataset_info)

Stores a formatted dictionary of operations made. Function CategoricalClassification.generate_data resets its contents. Each subsequent function call adds information to it.


CategoricalClassification.generate_data

CategoricalClassification.generate_data(n_features, 
                                        n_samples, 
                                        cardinality=5, 
                                        structure=None, 
                                        ensure_rep=False, 
                                        random_values=False, 
                                        low=0, 
                                        high=1000,
                                        k=10,
                                        seed=42)

Generates dataset of shape (n_samples, n_features), based on given parameters.

  • n_features: int The number of features in a generated dataset.
  • n_samples: int The number of samples in a generated dataset.
  • cardinality: int, default=5. Sets the default cardinality of a generated dataset.
  • structure: list, numpy.ndarray, default=None. Sets the structure of a generated dataset. Offers more controle over feature value domains and value distributions. Follows the format [tuple, tuple, ...], where:
    • tuple can either be:
      • (int or list, int): the first element represents the index or list of indexes of features. The second element their cardinality. Generated features will have a roughly normal density distribution of values, with a randomly selected value as a peak. The feature values will be integers, in range [0, second element of tuple].
      • (int or list, list): the first element represents the index or list of indexes of features. The second element offers two options:
        • list: a list of values to be used in the feature or features,
        • [list, list]: where the first list element represents a set of values the feature or features posses, the second the frequencies or probabilities of individual features.
  • ensure_rep: bool, default=False: Control flag. If True, all possible values will appear in the feature.
  • random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
  • low: int Sets lower bound of value domain of feature.
  • high: int Sets upper bound of value domain of feature. Only used when random_values is True.
  • k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
  • seed: int, default=42. Controls numpy.random.seed

Returns: a numpy.ndarray dataset with n_features features and n_samples samples.


CategoricalClassification._configure_generate_feature

CategoricalClassification._feature_builder(feature_attributes, 
                                           n_samples, 
                                           ensure_rep=False, 
                                           random_values=False, 
                                           low=0, 
                                           high=1000,
                                           k=10)

Helper function used to configure _generate_feature() with proper parameters based on feature_atributes.

  • feature_attributes: int or list or numpy.ndarray Attributes of feature. Can be just cardinality (int), value domain (list), or value domain and their respective probabilities (list).
  • n_samples: int Number of samples in dataset. Determines generated feature vector size.
  • ensure_rep: bool, default=False: Control flag. If True, all possible values will appear in the feature.
  • random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
  • low: int Sets lower bound of value domain of feature.
  • high: int Sets upper bound of value domain of feature. Only used when random_values is True.
  • k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak. Returns: a numpy.ndarray feature array.

CategoricalClassification._generate_feature

CategoricalClassification._generate_feature(size, 
                                            vec=None, 
                                            cardinality=5, 
                                            ensure_rep=False, 
                                            random_values=False, 
                                            low=0, 
                                            high=1000,
                                            k=10,
                                            p=None)

Generates feature array of length size. Called by CategoricalClassification.generate_data, by utilizing numpy.random.choice. If no probabilites array is given, the value density of the generated feature array will be roughly normal, with a randomly chosen peak. The peak will be chosen from the value array.

  • size: int Length of generated feature array.
  • vec: list or numpy.ndarray, default=None List of feature values, value domain of feature.
  • cardinality: int, default=5 Cardinality of feature to use when generating its value domain. If vec is not None, vec is used instead.
  • ensure_rep: bool, default=False Control flag. If True, all possible values will appear in the feature array.
  • random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
  • low: int Sets lower bound of value domain of feature.
  • high: int Sets upper bound of value domain of feature. Only used when random_values is True.
    • k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
  • p: list or numpy.ndarray, default=None Array of frequencies or probabilities. Must be of length v or equal to the length of v.

Returns: a numpy.ndarray feature array.


CategoricalClassification.generate_combinations

CategoricalClassification.generate_combinations(X, 
                                                feature_indices, 
                                                combination_function=None, 
                                                combination_type='linear')

Generates and adds a new column to given dataset X. The column is the result of a combination of features selected with feature_indices. Combinations can be linear, nonlinear, or custom defined functions.

  • X: list or numpy.ndarray: Dataset to perform the combinations on.
  • feature_indices: list or numpy.ndarray: List of feature (column) indices to be combined.
  • combination_function: function, default=None: Custom or user-defined combination function. The function parameter must be a list or numpy.ndarray of features to be combined. The function must return a list or numpy.ndarray column or columns, to be added to given dataset X using numpy.column_stack.
  • combination_type: str either linear or nonlinear, default='linear': Selects which built-in combination type is used.
    • If 'linear', the combination is a sum of selected features.
    • If 'nonlinear', the combination is the sine value of the sum of selected features.

Returns: a numpy.ndarray dataset X with added feature combinations.


CategoricalClassification._xor

CategoricalClassification._xor(arr)

Performs bitwise XOR on given vectors and returns result.

  • arr: list or numpy.ndarray List of features to perform the combination on.

Returns: a numpy.ndarray result of numpy.bitwise_xor(a,b) on given columns in arr.


CategoricalClassification._and

CategoricalClassification._and(arr)

Performs bitwise AND on given vectors and returns result.

  • arr: list or numpy.ndarray List of features to perform the combination on.

Returns: a numpy.ndarray result of numpy.bitwise_and(a,b) on given columns in arr.


CategoricalClassification._or

CategoricalClassification._or(arr)

Performs bitwise OR on given vectors and returns result.

  • arr: list or numpy.ndarray List of features to perform the combination on.

Returns: a numpy.ndarray result of numpy.bitwise_or(a,b) on given columns in arr.


CategoricalClassification.generate_correlated

CategoricalClassification.generate_correlated(X, 
                                              feature_indices, 
                                              r=0.8)

Generates and adds new columns to given dataset X, correlated to the selected features, by a Pearson correlation coefficient of r. For vectors with mean 0, their correlation equals the cosine of their angle.

  • X: list or numpy.ndarray: Dataset to perform the combinations on.
  • feature_indices: int or list or numpy.ndarray: Index of feature (column) or list of feature (column) indices to generate correlated features to.
  • r: float, default=0.8: Desired correlation coefficient.

Returns: a numpy.ndarray dataset X with added correlated features.


CategoricalClassification.generate_duplicates

CategoricalClassification.generate_duplicates(X, 
                                              feature_indices)

Duplicates selected feature (column) indices, and adds the duplicated columns to the given dataset X.

  • X: list or numpy.ndarray: Dataset to perform the combinations on.
  • feature_indices: int or list or numpy.ndarray: Index of feature (column) or list of feature (column) indices to duplicate.

Returns: a numpy.ndarray dataset X with added duplicated features.


CategoricalClassification.generate_labels

CategoricalClassification.generate_nonlinear_labels(X, 
                                                    n=2, 
                                                    p=0.5, 
                                                    k=2, 
                                                    decision_function=None, 
                                                    class_relation='linear', 
                                                    balance=False)

Generates a vector of labels. Labels are (currently) generated as either a linear, nonlinear, or custom defined function. It generates classes using a decision boundary generated by the linear, nonlinear, or custom defined function.

  • X: list or numpy.ndarray: Dataset to generate labels for.
  • n: int, default=2: Number of classes.
  • p: float or list, default=0.5: Class distribution.
  • k: int or float, default=2: Constant to be used in the linear or nonlinear combination used to set class values.
  • decision_function: function, default: None Custom defined function to use for setting class values. Must accept dataset X as input and return a list or numpy.ndarray decision boundary.
  • class_relation: str, either 'linear', 'nonlinear', or 'cluster' default='linear': Sets relationship type between class label and sample, by calculating a decision boundary with linear or nonlinear combinations of features in X, or by clustering the samples in X.
  • balance: boolean, default=False: Whether to naievly balance clusters generated by KMeans clustering.

Returns: numpy.ndarray y of class labels.


CategoricalClassification._cluster_data

CategoricalClassification._cluster_data(X, 
                                        n, 
                                        p=1.0, 
                                        balance=False)

Clusters given data using KMeans clustering.

  • X: list or numpy.ndarray: Dataset to cluster.
  • n: int: Number of clusters.
  • p: float or list or numpy.ndarray: To be used when balance=True, sets class distribution - number of samples per cluster.
  • balance: boolean, default=False: Whether to naievly balance clusters generated by KMeans clustering.

Returns: numpy.ndarray cluster_labels of clustering labels.


CategoricalClassification.generate_noise

CategoricalClassification.generate_noise(X, 
                                         y, 
                                         p=0.2, 
                                         type="categorical", 
                                         missing_val=float('-inf'))

Generates categorical noise or simulates missing data on a given dataset.

  • X: list or numpy.ndarray: Dataset to generate noise for.
  • y: list or numpy.ndarray: Labels of samples in dataset X. Required for generating categorical noise.
  • p: float, p <=1.0, default=0.2: Amount of noise to generate.
  • type: str, either "categorical" or "missing", default="categorical": Type of noise to generate.
  • missing_val: default=float('-inf'): Value to simulate missing values with. Non-numerical values may cause issues with algorithms unequipped to handle them.

Returns: numpy.ndarray X with added noise.


CategoricalClassification.downsample_dataset

CategoricalClassification.downsample_dataset(X, 
                                             y, 
                                             n=None, 
                                             seed=42, 
                                             reshuffle=False):

Downsamples given dataset according to N or the number of samples in minority class, resulting in a balanced dataset.

  • X: list or numpy.ndarray: Dataset to downsample.
  • y: list or numpy.ndarray: Labels corresponding to X.
  • N: int, optional: Optional number of samples per class to downsample to.
  • seed: int, default=42: Seed for random state of resample function.
  • reshuffle: boolean, default=False: Reshuffle the dataset after downsampling.

Returns: Balanced, downsampled numpy.ndarray X and numpy.ndarray y.


CategoricalClassification.print_dataset

CategoricalClassification.print_dataset(X, y)

Prints given dataset in a readable format.

  • X: list or numpy.ndarray: Dataset to print.
  • y: list or numpy.ndarray: Class labels corresponding to samples in given dataset.

CategoricalClassification.summarize

CategoricalClassification.summarize()

Prints stored dataset information dictionary in a digestible manner.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catclass-0.1.1.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

catclass-0.1.1-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file catclass-0.1.1.tar.gz.

File metadata

  • Download URL: catclass-0.1.1.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.11

File hashes

Hashes for catclass-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e7e18f86fd00c226a9918a675aa158d71e29ff930683644ae2c3d48a6d6b1d25
MD5 c9ee1975200ae9c35fe9ac60deb7bf3b
BLAKE2b-256 5537676c039a1b36c4f8b9216665c0104414ccc0f01303f37510954e56a2c1b7

See more details on using hashes here.

File details

Details for the file catclass-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: catclass-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.11

File hashes

Hashes for catclass-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ae5fd175eddb79c255f036321a8f811a7ac1fa50927ceb290f0524a58205b668
MD5 b70127b224f9f009a9d198b83ad3ac21
BLAKE2b-256 84ba7935d53546ccc96858ce90bc54d1013e4c1b43764e60d1160240e639c383

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page