Skip to main content

rdga_4k: random data generator algorithm for clustering

Project description

Project Author Python Version License

rdga_4k (Random Data Generator Algorithm for Clustering)

The rdga_4k library generates synthetic datasets tailored for clustering algorithm applications. It provides two core functions, catbird and canard, for customizable dataset generation with support for binary and categorical features.


🔥 Features

  • Synthetic Data for Clustering: Tailored datasets for clustering algorithm research and testing.
  • Flexible Configurations: Supports binary and categorical feature generation.
  • Noise and Intersection Control: Fine-tune feature noise and cluster intersections.
  • Reproducible Results: Ensure consistency with random seed support.

🛠 Installation

Install using pip:

pip install rdga_4k

🚀 Usage

Import the library and use the catbird or canard functions to generate datasets:

from rdga_4k import catbird, canard

# Example using catbird
X, y = catbird(
    n_feat=10,
    feat_sig=[3, 2],
    rate=[50, 50],
    lmbd=0.7,
    eps=0.1,
    random_state=42
)

# Example using canard
X, y = canard(
    n_feat=10,
    n_cat=3,
    rate=[50, 50],
    lmbd=5,
    eps=0.2,
    random_state=42
)

📜 Functions Overview

catbird

Generates a labeled dataset with binary features based on feature clustering.

Parameters

  • n_feat (int): Number of total features. Must be greater than 1.
  • feat_sig (list[int]): List of the number of significant features per cluster.
  • rate (list[int]): Number of examples per cluster.
  • lmbd (float): Intersection factor between features. Default is 0.8.
  • eps (float): Noise rate for feature generation. Default is 0.2.
  • random_state (int or RandomState, optional): Seed for reproducibility.

Returns

  • X (np.ndarray): Binary matrix representing the features.
  • y (np.ndarray): Array of cluster labels.

Example

X, y = catbird(n_feat=10, feat_sig=[3, 2], rate=[50, 50], lmbd=0.7, eps=0.1, random_state=42)

canard

Generates a labeled dataset with categorical features divided into multiple categories.

Parameters

  • n_feat (int): Number of total features. Must be greater than 1.
  • n_cat (int): Number of categories for each feature. Must be greater than 1.
  • rate (list[int]): Number of examples per cluster.
  • lmbd (int): Intersection factor between features. Default is 10.
  • eps (float): Noise rate for feature generation. Default is 0.3.
  • random_state (int or RandomState, optional): Seed for reproducibility.

Returns

  • X (np.ndarray): Matrix of categorical features.
  • y (np.ndarray): Array of cluster labels.

Example

X, y = canard(n_feat=10, n_cat=3, rate=[50, 50], lmbd=5, eps=0.2, random_state=42)

get_rate

Helper function that generates balanced and unbalanced rate lists for use with catbird or canard.

Parameters

  • N (int): Approximate total number of examples. Must be greater than 1.
  • k (int): Number of clusters. Must be greater than 1.
  • n_min (int): Minimum number of examples per cluster.

Returns

  • list: A list with two elements:
    • rate[0]: Balanced rate — equal distribution across clusters.
    • rate[1]: Unbalanced rate — decreasing distribution across clusters.

Example

from rdga_4k import get_rate, catbird

rate = get_rate(N=500, k=3, n_min=20)

# Balanced
X, y = catbird(n_feat=10, feat_sig=[3, 2, 2], rate=rate[0], random_state=42)

# Unbalanced
X, y = catbird(n_feat=10, feat_sig=[3, 2, 2], rate=rate[1], random_state=42)

📄 License

This project is licensed under the MIT License.


🤝 Contributing

Contributions are welcome! To contribute:

  1. Fork the repository.
  2. Create a new branch.
  3. Commit your changes.
  4. Push to the branch.
  5. Open a pull request.

For questions or information, feel free to reach out at: aquinordga@gmail.com.


👨‍💻 Author

Developed by AQUINO, R. D. G. Lattes ORCID Google Scholar


💬 Feedback

Feel free to open an issue or contact me for feedback or feature requests. Your input is highly appreciated!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdga_4k-1.0.0.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdga_4k-1.0.0-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file rdga_4k-1.0.0.tar.gz.

File metadata

  • Download URL: rdga_4k-1.0.0.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rdga_4k-1.0.0.tar.gz
Algorithm Hash digest
SHA256 78d421292017e1b6f8c79b7795c2de683b83ba122787c79d72c31e7385acf875
MD5 78dd165edf23ad4475dafc4fcb7ebe51
BLAKE2b-256 012e03268fa9eb8538b12a93e347a84e7b446a0dde4c3efc36c4c69a4b8a10fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for rdga_4k-1.0.0.tar.gz:

Publisher: publish.yml on aquinordg/rdga_4k

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rdga_4k-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: rdga_4k-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 5.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rdga_4k-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2d9318ab60a98ef69ed4fcb13ef733bea4aff94f63cfeebb08604dbc9522dda7
MD5 0cf2e5984b89c0f5118c5db1959b6522
BLAKE2b-256 f98b1912b7936f427251a30b5db7d0a466c2af407034d329103485b4e0b72283

See more details on using hashes here.

Provenance

The following attestation bundles were made for rdga_4k-1.0.0-py3-none-any.whl:

Publisher: publish.yml on aquinordg/rdga_4k

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page