Skip to main content

Generate synthetic data with clusters.

Project description

Tests Coverage

██████  ███████ ██████  ██      ██  ██████ ██      ██    ██ ███████ ████████ 
██   ██ ██      ██   ██ ██      ██ ██      ██      ██    ██ ██         ██    
██████  █████   ██████  ██      ██ ██      ██      ██    ██ ███████    ██    
██   ██ ██      ██      ██      ██ ██      ██      ██    ██      ██    ██    
██   ██ ███████ ██      ███████ ██  ██████ ███████  ██████  ███████    ██    

High-Level Synthetic Data Generation with Data Set Archetypes

repliclust is a Python package for generating synthetic datasets with clusters based on high-level descriptions. Instead of manually setting low-level parameters like cluster centroids or covariance matrices, you can simply describe the desired characteristics of your data, and repliclust will automatically generate datasets that match those specifications.

What can this software do for you?

  • Simplify Synthetic Data Generation: Eliminate the need to fine-tune low-level simulation parameters. Describe your desired scenario, and let repliclust handle the rest.

  • Enhance Benchmark Quality: By controlling high-level aspects of the data, you can create more informative benchmarks that reveal the strengths and weaknesses of clustering algorithms under various conditions.

  • Accelerate Research: Quickly generate diverse datasets to test hypotheses, validate models, and perform robustness checks.

Key Features

  • Generate Data from High-Level Descriptions: Create datasets by specifying scenarios such as "clusters with very different shapes and sizes" or "highly overlapping oblong clusters."

  • Data Set Archetypes: Use archetypes to define the overall geometry of your datasets with intuitive parameters that summarize cluster overlaps, shapes, sizes, and distributions.

  • Flexible Cluster Shapes: Go beyond convex, blob-like clusters by applying nonlinear transformations, such as random neural networks for distortion or stereographic projections to create directional data.

  • Reproducible and Informative Benchmarks: Independently manipulate different aspects of the data to create benchmarks that effectively evaluate and compare clustering algorithms under various conditions.

Demo

Try our demo here!

Installation

pip install repliclust

Quickstart

The easiest way to get started using repliclust is to create synthetic datasets from high-level descriptions in English. We build on on the OpenAI API, so to use these features you must provide an OpenAI API key. You can set it as OPENAI_API_KEY=<your-api-key> in an .env file, or pass it to individual functions as a keyword argument openai_api_key="<your-api-key>".

  • Generating data directly:
import repliclust as rpl

X, y, _ = rpl.generate("three highly separated oblong clusters in 10D", openai_api_key="<your-api-key>")
rpl.plot(X,y)
  • Creating an archetype:
archetype = rpl.Archetype.from_verbal_description(
    "seven gamma-distributed clusters in 2D of very different shapes",
    openai_api_key="<your-api-key>"
)
  • Generating data from the archetype:
X, y = archetype.synthesize()
  • Making cluster shapes more irregular:
X_irregular = rpl.distort(X)
X_directional = rpl.wrap_around_sphere(X)

Documentation

  • User Guide: Learn how to generate datasets from high-level descriptions in the User Guide.
  • Reference: Explore the package Reference.

Citation

To reference repliclust in your work, please cite:

@article{Zellinger:2023,
  title   = {High-Level Synthetic Data Generation with Data Set Archetypes},
  author  = {Zellinger, Michael J and B{\"u}hlmann, Peter},
  journal = {arXiv preprint arXiv:2303.14301},
  doi     = {10.48550/arXiv.2303.14301},
  year    = {2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repliclust-1.0.0.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

repliclust-1.0.0-py3-none-any.whl (44.7 kB view details)

Uploaded Python 3

File details

Details for the file repliclust-1.0.0.tar.gz.

File metadata

  • Download URL: repliclust-1.0.0.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for repliclust-1.0.0.tar.gz
Algorithm Hash digest
SHA256 2d4954443bf859b76152e12da29769580fcde69804ce51fb46c4038681d9d6eb
MD5 a6fc9281bedc45c3da645a0b688f95bf
BLAKE2b-256 61928dae9d41615d1c31680dda984b7dc9e0abb9ee08ef7b90144dfd500dc124

See more details on using hashes here.

File details

Details for the file repliclust-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: repliclust-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 44.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for repliclust-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c525ae077638f0c72feaf2f3dd478a1d891cd7c22811c932c59e3e7ca416d0c5
MD5 2273eb6e8fd5f6dd6680fc90af3a9b0e
BLAKE2b-256 9c6202d29d334192f2a377241492eb0be7c257f24f0c8b7d1c9deb3fbf2738d4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page