rdga_4k: random data generator algorithm for clustering
Project description
rdga_4k (Random Data Generator Algorithm for Clustering)
The rdga_4k library generates synthetic datasets tailored for clustering algorithm applications. It provides two core functions, catbird and canard, for customizable dataset generation with support for binary and categorical features.
🔥 Features
- Synthetic Data for Clustering: Tailored datasets for clustering algorithm research and testing.
- Flexible Configurations: Supports binary and categorical feature generation.
- Noise and Intersection Control: Fine-tune feature noise and cluster intersections.
- Reproducible Results: Ensure consistency with random seed support.
🛠 Installation
Install using pip:
pip install rdga_4k
🚀 Usage
Import the library and use the catbird or canard functions to generate datasets:
from rdga_4k import catbird, canard
# Example using catbird
X, y = catbird(
n_feat=10,
feat_sig=[3, 2],
rate=[50, 50],
lmbd=0.7,
eps=0.1,
random_state=42
)
# Example using canard
X, y = canard(
n_feat=10,
n_cat=3,
rate=[50, 50],
lmbd=5,
eps=0.2,
random_state=42
)
📜 Functions Overview
catbird
Generates a labeled dataset with binary features based on feature clustering.
Parameters
n_feat(int): Number of total features. Must be greater than 1.feat_sig(list[int]): List of the number of significant features per cluster.rate(list[int]): Number of examples per cluster.lmbd(float): Intersection factor between features. Default is0.8.eps(float): Noise rate for feature generation. Default is0.2.random_state(int or RandomState, optional): Seed for reproducibility.
Returns
X(np.ndarray): Binary matrix representing the features.y(np.ndarray): Array of cluster labels.
Example
X, y = catbird(n_feat=10, feat_sig=[3, 2], rate=[50, 50], lmbd=0.7, eps=0.1, random_state=42)
canard
Generates a labeled dataset with categorical features divided into multiple categories.
Parameters
n_feat(int): Number of total features. Must be greater than 1.n_cat(int): Number of categories for each feature. Must be greater than 1.rate(list[int]): Number of examples per cluster.lmbd(int): Intersection factor between features. Default is10.eps(float): Noise rate for feature generation. Default is0.3.random_state(int or RandomState, optional): Seed for reproducibility.
Returns
X(np.ndarray): Matrix of categorical features.y(np.ndarray): Array of cluster labels.
Example
X, y = canard(n_feat=10, n_cat=3, rate=[50, 50], lmbd=5, eps=0.2, random_state=42)
get_rate
Helper function that generates balanced and unbalanced rate lists for use with catbird or canard.
Parameters
N(int): Approximate total number of examples. Must be greater than 1.k(int): Number of clusters. Must be greater than 1.n_min(int): Minimum number of examples per cluster.
Returns
list: A list with two elements:rate[0]: Balanced rate — equal distribution across clusters.rate[1]: Unbalanced rate — decreasing distribution across clusters.
Example
from rdga_4k import get_rate, catbird
rate = get_rate(N=500, k=3, n_min=20)
# Balanced
X, y = catbird(n_feat=10, feat_sig=[3, 2, 2], rate=rate[0], random_state=42)
# Unbalanced
X, y = catbird(n_feat=10, feat_sig=[3, 2, 2], rate=rate[1], random_state=42)
📄 License
This project is licensed under the MIT License.
🤝 Contributing
Contributions are welcome! To contribute:
- Fork the repository.
- Create a new branch.
- Commit your changes.
- Push to the branch.
- Open a pull request.
For questions or information, feel free to reach out at: aquinordga@gmail.com.
👨💻 Author
💬 Feedback
Feel free to open an issue or contact me for feedback or feature requests. Your input is highly appreciated!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rdga_4k-1.0.0.tar.gz.
File metadata
- Download URL: rdga_4k-1.0.0.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78d421292017e1b6f8c79b7795c2de683b83ba122787c79d72c31e7385acf875
|
|
| MD5 |
78dd165edf23ad4475dafc4fcb7ebe51
|
|
| BLAKE2b-256 |
012e03268fa9eb8538b12a93e347a84e7b446a0dde4c3efc36c4c69a4b8a10fe
|
Provenance
The following attestation bundles were made for rdga_4k-1.0.0.tar.gz:
Publisher:
publish.yml on aquinordg/rdga_4k
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rdga_4k-1.0.0.tar.gz -
Subject digest:
78d421292017e1b6f8c79b7795c2de683b83ba122787c79d72c31e7385acf875 - Sigstore transparency entry: 1830797008
- Sigstore integration time:
-
Permalink:
aquinordg/rdga_4k@f830ed1020d766c1dbacea332fef8427727607c9 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/aquinordg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f830ed1020d766c1dbacea332fef8427727607c9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file rdga_4k-1.0.0-py3-none-any.whl.
File metadata
- Download URL: rdga_4k-1.0.0-py3-none-any.whl
- Upload date:
- Size: 5.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d9318ab60a98ef69ed4fcb13ef733bea4aff94f63cfeebb08604dbc9522dda7
|
|
| MD5 |
0cf2e5984b89c0f5118c5db1959b6522
|
|
| BLAKE2b-256 |
f98b1912b7936f427251a30b5db7d0a466c2af407034d329103485b4e0b72283
|
Provenance
The following attestation bundles were made for rdga_4k-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on aquinordg/rdga_4k
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rdga_4k-1.0.0-py3-none-any.whl -
Subject digest:
2d9318ab60a98ef69ed4fcb13ef733bea4aff94f63cfeebb08604dbc9522dda7 - Sigstore transparency entry: 1830797229
- Sigstore integration time:
-
Permalink:
aquinordg/rdga_4k@f830ed1020d766c1dbacea332fef8427727607c9 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/aquinordg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f830ed1020d766c1dbacea332fef8427727607c9 -
Trigger Event:
release
-
Statement type: