Skip to main content

rdga_4k: random data generator algorithm for clustering

Project description

Project Author Python Version License

rdga_4k (Random Data Generator Algorithm for Clustering)

The rdga_4k library generates synthetic datasets tailored for clustering algorithm applications. It provides two core functions, catbird and canard, for customizable dataset generation with support for binary and categorical features.


🔥 Features

  • Synthetic Data for Clustering: Tailored datasets for clustering algorithm research and testing.
  • Flexible Configurations: Supports binary and categorical feature generation.
  • Noise and Intersection Control: Fine-tune feature noise and cluster intersections.
  • Reproducible Results: Ensure consistency with random seed support.

🛠 Installation

Install using pip:

pip install rdga_4k

Or directly from the source:

pip install git+https://github.com/aquinordg/rdga_4k.git

🚀 Usage

Import the library and use the catbird or canard functions to generate datasets:

from rdga_4k import catbird, canard

# Example using catbird
X, y = catbird(
    n_feat=10,
    feat_sig=[3, 2],
    rate=[50, 50],
    lmbd=0.7,
    eps=0.1,
    random_state=42
)

# Example using canard
X, y = canard(
    n_feat=10,
    n_cat=3,
    rate=[50, 50],
    lmbd=5,
    eps=0.2,
    random_state=42
)

📜 Functions Overview

catbird

Generates a labeled dataset with binary features based on feature clustering.

Parameters

  • n_feat (int): Number of total features. Must be greater than 1.
  • feat_sig (list[int]): List of the number of significant features per cluster.
  • rate (list[int]): Number of examples per cluster.
  • lmbd (float): Intersection factor between features. Default is 0.8.
  • eps (float): Noise rate for feature generation. Default is 0.2.
  • random_state (int or RandomState, optional): Seed for reproducibility.

Returns

  • X (np.ndarray): Binary matrix representing the features.
  • y (np.ndarray): Array of cluster labels.

Example

X, y = catbird(n_feat=10, feat_sig=[3, 2], rate=[50, 50], lmbd=0.7, eps=0.1, random_state=42)

canard

Generates a labeled dataset with categorical features divided into multiple categories.

Parameters

  • n_feat (int): Number of total features. Must be greater than 1.
  • n_cat (int): Number of categories for each feature. Must be greater than 1.
  • rate (list[int]): Number of examples per cluster.
  • lmbd (int): Intersection factor between features. Default is 10.
  • eps (float): Noise rate for feature generation. Default is 0.3.
  • random_state (int or RandomState, optional): Seed for reproducibility.

Returns

  • X (np.ndarray): Matrix of categorical features.
  • y (np.ndarray): Array of cluster labels.

Example

X, y = canard(n_feat=10, n_cat=3, rate=[50, 50], lmbd=5, eps=0.2, random_state=42)

get_rate

Helper function that generates balanced and unbalanced rate lists for use with catbird or canard.

Parameters

  • N (int): Approximate total number of examples. Must be greater than 1.
  • k (int): Number of clusters. Must be greater than 1.
  • n_min (int): Minimum number of examples per cluster.

Returns

  • list: A list with two elements:
    • rate[0]: Balanced rate — equal distribution across clusters.
    • rate[1]: Unbalanced rate — decreasing distribution across clusters.

Example

from rdga_4k import get_rate, catbird

rate = get_rate(N=500, k=3, n_min=20)

# Balanced
X, y = catbird(n_feat=10, feat_sig=[3, 2, 2], rate=rate[0], random_state=42)

# Unbalanced
X, y = catbird(n_feat=10, feat_sig=[3, 2, 2], rate=rate[1], random_state=42)

📄 License

This project is licensed under the MIT License.


🤝 Contributing

Contributions are welcome! To contribute:

  1. Fork the repository.
  2. Create a new branch.
  3. Commit your changes.
  4. Push to the branch.
  5. Open a pull request.

For questions or information, feel free to reach out at: aquinordga@gmail.com.


👨‍💻 Author

Developed by AQUINO, R. D. G. Lattes ORCID Google Scholar


💬 Feedback

Feel free to open an issue or contact me for feedback or feature requests. Your input is highly appreciated!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdga_4k-0.1.1.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdga_4k-0.1.1-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file rdga_4k-0.1.1.tar.gz.

File metadata

  • Download URL: rdga_4k-0.1.1.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rdga_4k-0.1.1.tar.gz
Algorithm Hash digest
SHA256 50d894da9e23405b35bc30e94c5fd94f7a64c1ff27feb1309b0f51d2dce2f580
MD5 a869aacda3c09b22255633a4a0ee513a
BLAKE2b-256 c4baaa0c1c7d837b49b67541f0e7353c08250edb46aee2fc885326384674b18b

See more details on using hashes here.

Provenance

The following attestation bundles were made for rdga_4k-0.1.1.tar.gz:

Publisher: publish.yml on aquinordg/rdga_4k

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rdga_4k-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: rdga_4k-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 5.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rdga_4k-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8806323eccdac5d35db4c874602322e3ddb48d60151c28a2fcc517f7a48d6cc
MD5 2a0b8aca1efa195b192d5e287437e4cc
BLAKE2b-256 cabf738504097fd4b7771c152cb52aeaa863d8ae20b944c6633a18d04aee0651

See more details on using hashes here.

Provenance

The following attestation bundles were made for rdga_4k-0.1.1-py3-none-any.whl:

Publisher: publish.yml on aquinordg/rdga_4k

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page