Generate synthetic data with clusters.
Project description
██████ ███████ ██████ ██ ██ ██████ ██ ██ ██ ███████ ████████
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██████ █████ ██████ ██ ██ ██ ██ ██ ██ ███████ ██
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██ ██ ███████ ██ ███████ ██ ██████ ███████ ██████ ███████ ██
High-Level Synthetic Data Generation with Data Set Archetypes
repliclust is a Python package for generating synthetic datasets with clusters based on high-level descriptions. Instead of manually setting low-level parameters like cluster centroids or covariance matrices, you can simply describe the desired characteristics of your data, and repliclust will automatically generate datasets that match those specifications.
What can this software do for you?
-
Simplify Synthetic Data Generation: Eliminate the need to fine-tune low-level simulation parameters. Describe your desired scenario, and let repliclust handle the rest.
-
Enhance Benchmark Quality: By controlling high-level aspects of the data, you can create more informative benchmarks that reveal the strengths and weaknesses of clustering algorithms under various conditions.
-
Accelerate Research: Quickly generate diverse datasets to test hypotheses, validate models, and perform robustness checks.
Key Features
-
Generate Data from High-Level Descriptions: Create datasets by specifying scenarios such as "clusters with very different shapes and sizes" or "highly overlapping oblong clusters."
-
Data Set Archetypes: Use archetypes to define the overall geometry of your datasets with intuitive parameters that summarize cluster overlaps, shapes, sizes, and distributions.
-
Flexible Cluster Shapes: Go beyond convex, blob-like clusters by applying nonlinear transformations, such as random neural networks for distortion or stereographic projections to create directional data.
-
Reproducible and Informative Benchmarks: Independently manipulate different aspects of the data to create benchmarks that effectively evaluate and compare clustering algorithms under various conditions.
Demo
Try our demo here!
Installation
pip install repliclust
Quickstart
The easiest way to get started using repliclust is to create synthetic datasets from high-level descriptions in English. We build on on the OpenAI API, so to use these features you must provide an OpenAI API key. You can set it as OPENAI_API_KEY=<your-api-key>
in an .env file, or pass it to individual functions as a keyword argument openai_api_key="<your-api-key>"
.
- Generating data directly:
import repliclust as rpl
X, y, _ = rpl.generate("three highly separated oblong clusters in 10D", openai_api_key="<your-api-key>")
rpl.plot(X,y)
- Creating an archetype:
archetype = rpl.Archetype.from_verbal_description(
"seven gamma-distributed clusters in 2D of very different shapes",
openai_api_key="<your-api-key>"
)
- Generating data from the archetype:
X, y = archetype.synthesize()
- Making cluster shapes more irregular:
X_irregular = rpl.distort(X)
X_directional = rpl.wrap_around_sphere(X)
Documentation
- User Guide: Learn how to generate datasets from high-level descriptions in the User Guide.
- Reference: Explore the package Reference.
Citation
To reference repliclust in your work, please cite:
@article{Zellinger:2023,
title = {High-Level Synthetic Data Generation with Data Set Archetypes},
author = {Zellinger, Michael J and B{\"u}hlmann, Peter},
journal = {arXiv preprint arXiv:2303.14301},
doi = {10.48550/arXiv.2303.14301},
year = {2023}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file repliclust-1.0.0.tar.gz
.
File metadata
- Download URL: repliclust-1.0.0.tar.gz
- Upload date:
- Size: 2.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2d4954443bf859b76152e12da29769580fcde69804ce51fb46c4038681d9d6eb |
|
MD5 | a6fc9281bedc45c3da645a0b688f95bf |
|
BLAKE2b-256 | 61928dae9d41615d1c31680dda984b7dc9e0abb9ee08ef7b90144dfd500dc124 |
File details
Details for the file repliclust-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: repliclust-1.0.0-py3-none-any.whl
- Upload date:
- Size: 44.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c525ae077638f0c72feaf2f3dd478a1d891cd7c22811c932c59e3e7ca416d0c5 |
|
MD5 | 2273eb6e8fd5f6dd6680fc90af3a9b0e |
|
BLAKE2b-256 | 9c6202d29d334192f2a377241492eb0be7c257f24f0c8b7d1c9deb3fbf2738d4 |