A robust framework for generating synthetic categorical datasets for evaluation or testing purposes.
Project description
Categorical Classification
A robust framework for generating synthetic categorical datasets for evaluation or testing purposes.
Usage
Creating a simple dataset
# Creates a simple dataset of 10 features, 10k samples, with feature cardinality of all features being 35
X = cc.generate_data(9,
10000,
cardinality=35,
ensure_rep=True,
random_values=True,
low=0,
high=40)
# Creates target labels via clustering
y = cc.generate_labels(X, n=2, class_relation='cluster')
Documentation
CategoricalClassification.dataset_info
print(CategoricalClassification.dataset_info)
Stores a formatted dictionary of operations made. Function CategoricalClassification.generate_data resets its contents. Each subsequent function call adds information to it.
CategoricalClassification.generate_data
CategoricalClassification.generate_data(n_features,
n_samples,
cardinality=5,
structure=None,
ensure_rep=False,
random_values=False,
low=0,
high=1000,
k=10,
seed=42)
Generates dataset of shape (n_samples, n_features), based on given parameters.
- n_features: int The number of features in a generated dataset.
- n_samples: int The number of samples in a generated dataset.
- cardinality: int, default=5. Sets the default cardinality of a generated dataset.
- structure: list, numpy.ndarray, default=None.
Sets the structure of a generated dataset. Offers more controle over feature value domains and value distributions.
Follows the format [tuple, tuple, ...], where:
- tuple can either be:
- (int or list, int): the first element represents the index or list of indexes of features. The second element their cardinality. Generated features will have a roughly normal density distribution of values, with a randomly selected value as a peak. The feature values will be integers, in range [0, second element of tuple].
- (int or list, list): the first element represents the index or list of indexes of features. The second element offers two options:
- list: a list of values to be used in the feature or features,
- [list, list]: where the first list element represents a set of values the feature or features posses, the second the frequencies or probabilities of individual features.
- tuple can either be:
- ensure_rep: bool, default=False: Control flag. If True, all possible values will appear in the feature.
- random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
- low: int Sets lower bound of value domain of feature.
- high: int Sets upper bound of value domain of feature. Only used when random_values is True.
- k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
- seed: int, default=42. Controls numpy.random.seed
Returns: a numpy.ndarray dataset with n_features features and n_samples samples.
CategoricalClassification._configure_generate_feature
CategoricalClassification._feature_builder(feature_attributes,
n_samples,
ensure_rep=False,
random_values=False,
low=0,
high=1000,
k=10)
Helper function used to configure _generate_feature() with proper parameters based on feature_atributes.
- feature_attributes: int or list or numpy.ndarray Attributes of feature. Can be just cardinality (int), value domain (list), or value domain and their respective probabilities (list).
- n_samples: int Number of samples in dataset. Determines generated feature vector size.
- ensure_rep: bool, default=False: Control flag. If True, all possible values will appear in the feature.
- random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
- low: int Sets lower bound of value domain of feature.
- high: int Sets upper bound of value domain of feature. Only used when random_values is True.
- k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak. Returns: a numpy.ndarray feature array.
CategoricalClassification._generate_feature
CategoricalClassification._generate_feature(size,
vec=None,
cardinality=5,
ensure_rep=False,
random_values=False,
low=0,
high=1000,
k=10,
p=None)
Generates feature array of length size. Called by CategoricalClassification.generate_data, by utilizing numpy.random.choice. If no probabilites array is given, the value density of the generated feature array will be roughly normal, with a randomly chosen peak. The peak will be chosen from the value array.
- size: int Length of generated feature array.
- vec: list or numpy.ndarray, default=None List of feature values, value domain of feature.
- cardinality: int, default=5 Cardinality of feature to use when generating its value domain. If vec is not None, vec is used instead.
- ensure_rep: bool, default=False Control flag. If True, all possible values will appear in the feature array.
- random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
- low: int Sets lower bound of value domain of feature.
- high: int Sets upper bound of value domain of feature. Only used when random_values is True.
-
- k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
- p: list or numpy.ndarray, default=None Array of frequencies or probabilities. Must be of length v or equal to the length of v.
Returns: a numpy.ndarray feature array.
CategoricalClassification.generate_combinations
CategoricalClassification.generate_combinations(X,
feature_indices,
combination_function=None,
combination_type='linear')
Generates and adds a new column to given dataset X. The column is the result of a combination of features selected with feature_indices. Combinations can be linear, nonlinear, or custom defined functions.
- X: list or numpy.ndarray: Dataset to perform the combinations on.
- feature_indices: list or numpy.ndarray: List of feature (column) indices to be combined.
- combination_function: function, default=None: Custom or user-defined combination function. The function parameter must be a list or numpy.ndarray of features to be combined. The function must return a list or numpy.ndarray column or columns, to be added to given dataset X using numpy.column_stack.
- combination_type: str either linear or nonlinear, default='linear':
Selects which built-in combination type is used.
- If 'linear', the combination is a sum of selected features.
- If 'nonlinear', the combination is the sine value of the sum of selected features.
Returns: a numpy.ndarray dataset X with added feature combinations.
CategoricalClassification._xor
CategoricalClassification._xor(arr)
Performs bitwise XOR on given vectors and returns result.
- arr: list or numpy.ndarray List of features to perform the combination on.
Returns: a numpy.ndarray result of numpy.bitwise_xor(a,b) on given columns in arr.
CategoricalClassification._and
CategoricalClassification._and(arr)
Performs bitwise AND on given vectors and returns result.
- arr: list or numpy.ndarray List of features to perform the combination on.
Returns: a numpy.ndarray result of numpy.bitwise_and(a,b) on given columns in arr.
CategoricalClassification._or
CategoricalClassification._or(arr)
Performs bitwise OR on given vectors and returns result.
- arr: list or numpy.ndarray List of features to perform the combination on.
Returns: a numpy.ndarray result of numpy.bitwise_or(a,b) on given columns in arr.
CategoricalClassification.generate_correlated
CategoricalClassification.generate_correlated(X,
feature_indices,
r=0.8)
Generates and adds new columns to given dataset X, correlated to the selected features, by a Pearson correlation coefficient of r. For vectors with mean 0, their correlation equals the cosine of their angle.
- X: list or numpy.ndarray: Dataset to perform the combinations on.
- feature_indices: int or list or numpy.ndarray: Index of feature (column) or list of feature (column) indices to generate correlated features to.
- r: float, default=0.8: Desired correlation coefficient.
Returns: a numpy.ndarray dataset X with added correlated features.
CategoricalClassification.generate_duplicates
CategoricalClassification.generate_duplicates(X,
feature_indices)
Duplicates selected feature (column) indices, and adds the duplicated columns to the given dataset X.
- X: list or numpy.ndarray: Dataset to perform the combinations on.
- feature_indices: int or list or numpy.ndarray: Index of feature (column) or list of feature (column) indices to duplicate.
Returns: a numpy.ndarray dataset X with added duplicated features.
CategoricalClassification.generate_labels
CategoricalClassification.generate_nonlinear_labels(X,
n=2,
p=0.5,
k=2,
decision_function=None,
class_relation='linear',
balance=False)
Generates a vector of labels. Labels are (currently) generated as either a linear, nonlinear, or custom defined function. It generates classes using a decision boundary generated by the linear, nonlinear, or custom defined function.
- X: list or numpy.ndarray: Dataset to generate labels for.
- n: int, default=2: Number of classes.
- p: float or list, default=0.5: Class distribution.
- k: int or float, default=2: Constant to be used in the linear or nonlinear combination used to set class values.
- decision_function: function, default: None Custom defined function to use for setting class values. Must accept dataset X as input and return a list or numpy.ndarray decision boundary.
- class_relation: str, either 'linear', 'nonlinear', or 'cluster' default='linear': Sets relationship type between class label and sample, by calculating a decision boundary with linear or nonlinear combinations of features in X, or by clustering the samples in X.
- balance: boolean, default=False: Whether to naievly balance clusters generated by KMeans clustering.
Returns: numpy.ndarray y of class labels.
CategoricalClassification._cluster_data
CategoricalClassification._cluster_data(X,
n,
p=1.0,
balance=False)
Clusters given data using KMeans clustering.
- X: list or numpy.ndarray: Dataset to cluster.
- n: int: Number of clusters.
- p: float or list or numpy.ndarray: To be used when balance=True, sets class distribution - number of samples per cluster.
- balance: boolean, default=False: Whether to naievly balance clusters generated by KMeans clustering.
Returns: numpy.ndarray cluster_labels of clustering labels.
CategoricalClassification.generate_noise
CategoricalClassification.generate_noise(X,
y,
p=0.2,
type="categorical",
missing_val=float('-inf'))
Generates categorical noise or simulates missing data on a given dataset.
- X: list or numpy.ndarray: Dataset to generate noise for.
- y: list or numpy.ndarray: Labels of samples in dataset X. Required for generating categorical noise.
- p: float, p <=1.0, default=0.2: Amount of noise to generate.
- type: str, either "categorical" or "missing", default="categorical": Type of noise to generate.
- missing_val: default=float('-inf'): Value to simulate missing values with. Non-numerical values may cause issues with algorithms unequipped to handle them.
Returns: numpy.ndarray X with added noise.
CategoricalClassification.downsample_dataset
CategoricalClassification.downsample_dataset(X,
y,
n=None,
seed=42,
reshuffle=False):
Downsamples given dataset according to N or the number of samples in minority class, resulting in a balanced dataset.
- X: list or numpy.ndarray: Dataset to downsample.
- y: list or numpy.ndarray: Labels corresponding to X.
- N: int, optional: Optional number of samples per class to downsample to.
- seed: int, default=42: Seed for random state of resample function.
- reshuffle: boolean, default=False: Reshuffle the dataset after downsampling.
Returns: Balanced, downsampled numpy.ndarray X and numpy.ndarray y.
CategoricalClassification.print_dataset
CategoricalClassification.print_dataset(X, y)
Prints given dataset in a readable format.
- X: list or numpy.ndarray: Dataset to print.
- y: list or numpy.ndarray: Class labels corresponding to samples in given dataset.
CategoricalClassification.summarize
CategoricalClassification.summarize()
Prints stored dataset information dictionary in a digestible manner.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file catclass-0.1.1.tar.gz.
File metadata
- Download URL: catclass-0.1.1.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7e18f86fd00c226a9918a675aa158d71e29ff930683644ae2c3d48a6d6b1d25
|
|
| MD5 |
c9ee1975200ae9c35fe9ac60deb7bf3b
|
|
| BLAKE2b-256 |
5537676c039a1b36c4f8b9216665c0104414ccc0f01303f37510954e56a2c1b7
|
File details
Details for the file catclass-0.1.1-py3-none-any.whl.
File metadata
- Download URL: catclass-0.1.1-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae5fd175eddb79c255f036321a8f811a7ac1fa50927ceb290f0524a58205b668
|
|
| MD5 |
b70127b224f9f009a9d198b83ad3ac21
|
|
| BLAKE2b-256 |
84ba7935d53546ccc96858ce90bc54d1013e4c1b43764e60d1160240e639c383
|