Skip to main content

A robust framework for generating synthetic categorical datasets for evaluation or testing purposes.

Project description

Categorical Classification

A robust framework for generating synthetic categorical datasets for evaluation or testing purposes.

Usage


Creating a simple dataset

# Creates a simple dataset of 10 features, 10k samples, with feature cardinality of all features being 35
X = cc.generate_data(9, 
                     10000, 
                     cardinality=35, 
                     ensure_rep=True, 
                     random_values=True, 
                     low=0, 
                     high=40)

# Creates target labels via clustering
y = cc.generate_labels(X, n=2, class_relation='cluster')

Documentation


CategoricalClassification.dataset_info

print(CategoricalClassification.dataset_info)

Stores a formatted dictionary of operations made. Function CategoricalClassification.generate_data resets its contents. Each subsequent function call adds information to it.


CategoricalClassification.generate_data

CategoricalClassification.generate_data(n_features, 
                                        n_samples, 
                                        cardinality=5, 
                                        structure=None, 
                                        ensure_rep=False, 
                                        random_values=False, 
                                        low=0, 
                                        high=1000,
                                        k=10,
                                        seed=42)

Generates dataset of shape (n_samples, n_features), based on given parameters.

  • n_features: int The number of features in a generated dataset.
  • n_samples: int The number of samples in a generated dataset.
  • cardinality: int, default=5. Sets the default cardinality of a generated dataset.
  • structure: list, numpy.ndarray, default=None. Sets the structure of a generated dataset. Offers more controle over feature value domains and value distributions. Follows the format [tuple, tuple, ...], where:
    • tuple can either be:
      • (int or list, int): the first element represents the index or list of indexes of features. The second element their cardinality. Generated features will have a roughly normal density distribution of values, with a randomly selected value as a peak. The feature values will be integers, in range [0, second element of tuple].
      • (int or list, list): the first element represents the index or list of indexes of features. The second element offers two options:
        • list: a list of values to be used in the feature or features,
        • [list, list]: where the first list element represents a set of values the feature or features posses, the second the frequencies or probabilities of individual features.
  • ensure_rep: bool, default=False: Control flag. If True, all possible values will appear in the feature.
  • random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
  • low: int Sets lower bound of value domain of feature.
  • high: int Sets upper bound of value domain of feature. Only used when random_values is True.
  • k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
  • seed: int, default=42. Controls numpy.random.seed

Returns: a numpy.ndarray dataset with n_features features and n_samples samples.


CategoricalClassification._configure_generate_feature

CategoricalClassification._feature_builder(feature_attributes, 
                                           n_samples, 
                                           ensure_rep=False, 
                                           random_values=False, 
                                           low=0, 
                                           high=1000,
                                           k=10)

Helper function used to configure _generate_feature() with proper parameters based on feature_atributes.

  • feature_attributes: int or list or numpy.ndarray Attributes of feature. Can be just cardinality (int), value domain (list), or value domain and their respective probabilities (list).
  • n_samples: int Number of samples in dataset. Determines generated feature vector size.
  • ensure_rep: bool, default=False: Control flag. If True, all possible values will appear in the feature.
  • random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
  • low: int Sets lower bound of value domain of feature.
  • high: int Sets upper bound of value domain of feature. Only used when random_values is True.
  • k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak. Returns: a numpy.ndarray feature array.

CategoricalClassification._generate_feature

CategoricalClassification._generate_feature(size, 
                                            vec=None, 
                                            cardinality=5, 
                                            ensure_rep=False, 
                                            random_values=False, 
                                            low=0, 
                                            high=1000,
                                            k=10,
                                            p=None)

Generates feature array of length size. Called by CategoricalClassification.generate_data, by utilizing numpy.random.choice. If no probabilites array is given, the value density of the generated feature array will be roughly normal, with a randomly chosen peak. The peak will be chosen from the value array.

  • size: int Length of generated feature array.
  • vec: list or numpy.ndarray, default=None List of feature values, value domain of feature.
  • cardinality: int, default=5 Cardinality of feature to use when generating its value domain. If vec is not None, vec is used instead.
  • ensure_rep: bool, default=False Control flag. If True, all possible values will appear in the feature array.
  • random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
  • low: int Sets lower bound of value domain of feature.
  • high: int Sets upper bound of value domain of feature. Only used when random_values is True.
    • k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
  • p: list or numpy.ndarray, default=None Array of frequencies or probabilities. Must be of length v or equal to the length of v.

Returns: a numpy.ndarray feature array.


CategoricalClassification.generate_combinations

CategoricalClassification.generate_combinations(X, 
                                                feature_indices, 
                                                combination_function=None, 
                                                combination_type='linear')

Generates and adds a new column to given dataset X. The column is the result of a combination of features selected with feature_indices. Combinations can be linear, nonlinear, or custom defined functions.

  • X: list or numpy.ndarray: Dataset to perform the combinations on.
  • feature_indices: list or numpy.ndarray: List of feature (column) indices to be combined.
  • combination_function: function, default=None: Custom or user-defined combination function. The function parameter must be a list or numpy.ndarray of features to be combined. The function must return a list or numpy.ndarray column or columns, to be added to given dataset X using numpy.column_stack.
  • combination_type: str either linear or nonlinear, default='linear': Selects which built-in combination type is used.
    • If 'linear', the combination is a sum of selected features.
    • If 'nonlinear', the combination is the sine value of the sum of selected features.

Returns: a numpy.ndarray dataset X with added feature combinations.


CategoricalClassification._xor

CategoricalClassification._xor(arr)

Performs bitwise XOR on given vectors and returns result.

  • arr: list or numpy.ndarray List of features to perform the combination on.

Returns: a numpy.ndarray result of numpy.bitwise_xor(a,b) on given columns in arr.


CategoricalClassification._and

CategoricalClassification._and(arr)

Performs bitwise AND on given vectors and returns result.

  • arr: list or numpy.ndarray List of features to perform the combination on.

Returns: a numpy.ndarray result of numpy.bitwise_and(a,b) on given columns in arr.


CategoricalClassification._or

CategoricalClassification._or(arr)

Performs bitwise OR on given vectors and returns result.

  • arr: list or numpy.ndarray List of features to perform the combination on.

Returns: a numpy.ndarray result of numpy.bitwise_or(a,b) on given columns in arr.


CategoricalClassification.generate_correlated

CategoricalClassification.generate_correlated(X, 
                                              feature_indices, 
                                              r=0.8)

Generates and adds new columns to given dataset X, correlated to the selected features, by a Pearson correlation coefficient of r. For vectors with mean 0, their correlation equals the cosine of their angle.

  • X: list or numpy.ndarray: Dataset to perform the combinations on.
  • feature_indices: int or list or numpy.ndarray: Index of feature (column) or list of feature (column) indices to generate correlated features to.
  • r: float, default=0.8: Desired correlation coefficient.

Returns: a numpy.ndarray dataset X with added correlated features.


CategoricalClassification.generate_duplicates

CategoricalClassification.generate_duplicates(X, 
                                              feature_indices)

Duplicates selected feature (column) indices, and adds the duplicated columns to the given dataset X.

  • X: list or numpy.ndarray: Dataset to perform the combinations on.
  • feature_indices: int or list or numpy.ndarray: Index of feature (column) or list of feature (column) indices to duplicate.

Returns: a numpy.ndarray dataset X with added duplicated features.


CategoricalClassification.generate_labels

CategoricalClassification.generate_nonlinear_labels(X, 
                                                    n=2, 
                                                    p=0.5, 
                                                    k=2, 
                                                    decision_function=None, 
                                                    class_relation='linear', 
                                                    balance=False)

Generates a vector of labels. Labels are (currently) generated as either a linear, nonlinear, or custom defined function. It generates classes using a decision boundary generated by the linear, nonlinear, or custom defined function.

  • X: list or numpy.ndarray: Dataset to generate labels for.
  • n: int, default=2: Number of classes.
  • p: float or list, default=0.5: Class distribution.
  • k: int or float, default=2: Constant to be used in the linear or nonlinear combination used to set class values.
  • decision_function: function, default: None Custom defined function to use for setting class values. Must accept dataset X as input and return a list or numpy.ndarray decision boundary.
  • class_relation: str, either 'linear', 'nonlinear', or 'cluster' default='linear': Sets relationship type between class label and sample, by calculating a decision boundary with linear or nonlinear combinations of features in X, or by clustering the samples in X.
  • balance: boolean, default=False: Whether to naievly balance clusters generated by KMeans clustering.

Returns: numpy.ndarray y of class labels.


CategoricalClassification._cluster_data

CategoricalClassification._cluster_data(X, 
                                        n, 
                                        p=1.0, 
                                        balance=False)

Clusters given data using KMeans clustering.

  • X: list or numpy.ndarray: Dataset to cluster.
  • n: int: Number of clusters.
  • p: float or list or numpy.ndarray: To be used when balance=True, sets class distribution - number of samples per cluster.
  • balance: boolean, default=False: Whether to naievly balance clusters generated by KMeans clustering.

Returns: numpy.ndarray cluster_labels of clustering labels.


CategoricalClassification.generate_noise

CategoricalClassification.generate_noise(X, 
                                         y, 
                                         p=0.2, 
                                         type="categorical", 
                                         missing_val=float('-inf'))

Generates categorical noise or simulates missing data on a given dataset.

  • X: list or numpy.ndarray: Dataset to generate noise for.
  • y: list or numpy.ndarray: Labels of samples in dataset X. Required for generating categorical noise.
  • p: float, p <=1.0, default=0.2: Amount of noise to generate.
  • type: str, either "categorical" or "missing", default="categorical": Type of noise to generate.
  • missing_val: default=float('-inf'): Value to simulate missing values with. Non-numerical values may cause issues with algorithms unequipped to handle them.

Returns: numpy.ndarray X with added noise.


CategoricalClassification.downsample_dataset

CategoricalClassification.downsample_dataset(X, 
                                             y, 
                                             n=None, 
                                             seed=42, 
                                             reshuffle=False):

Downsamples given dataset according to N or the number of samples in minority class, resulting in a balanced dataset.

  • X: list or numpy.ndarray: Dataset to downsample.
  • y: list or numpy.ndarray: Labels corresponding to X.
  • N: int, optional: Optional number of samples per class to downsample to.
  • seed: int, default=42: Seed for random state of resample function.
  • reshuffle: boolean, default=False: Reshuffle the dataset after downsampling.

Returns: Balanced, downsampled numpy.ndarray X and numpy.ndarray y.


CategoricalClassification.print_dataset

CategoricalClassification.print_dataset(X, y)

Prints given dataset in a readable format.

  • X: list or numpy.ndarray: Dataset to print.
  • y: list or numpy.ndarray: Class labels corresponding to samples in given dataset.

CategoricalClassification.summarize

CategoricalClassification.summarize()

Prints stored dataset information dictionary in a digestible manner.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catclass-0.1.0.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

catclass-0.1.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file catclass-0.1.0.tar.gz.

File metadata

  • Download URL: catclass-0.1.0.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.11

File hashes

Hashes for catclass-0.1.0.tar.gz
Algorithm Hash digest
SHA256 12b434b1cbf7b1bcb315206b6a70524c4347fb5e6c9daa98c2616dc5eb9aae39
MD5 5fd87be36b0b69c7d2d49120c5d6140e
BLAKE2b-256 f85e1b87ee83af17f156a302e0450780882e31d5cb8e47438ff2dfc486f8c01d

See more details on using hashes here.

File details

Details for the file catclass-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: catclass-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.11

File hashes

Hashes for catclass-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 25d18f41a355e2a7738696bf1e797b9ca7990843d5c3331060e4bb6e7fbdd78f
MD5 454df88712849a95733816dba164f7cd
BLAKE2b-256 83c17dca240c494b7121b901187f50d360e76d7efab1101332f29b282439c4f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page