Skip to main content

Generate synthetic data with clusters.

Project description

Tests Coverage

██████  ███████ ██████  ██      ██  ██████ ██      ██    ██ ███████ ████████ 
██   ██ ██      ██   ██ ██      ██ ██      ██      ██    ██ ██         ██    
██████  █████   ██████  ██      ██ ██      ██      ██    ██ ███████    ██    
██   ██ ██      ██      ██      ██ ██      ██      ██    ██      ██    ██    
██   ██ ███████ ██      ███████ ██  ██████ ███████  ██████  ███████    ██    

Description

repliclust is a Python package for generating synthetic datasets with clusters. It allows you to generate many different datasets that are geometrically similar, but without ever touching low-level parameters like cluster centroids or covariance matrices.

Features

  • Reproducibly generate clusters with defined geometric characteristics
  • Manage cluster overlaps, shapes, and probability distributions through intuitive, high-level controls
  • Define custom dataset archetypes to power reproducible and informative benchmarks

Installation

pip install repliclust

Quickstart

from repliclust import Archetype, DataGenerator

# Create archetype for 5 oblong clusters with typical "aspect ratio" of 3
oblong_clusters = Archetype(n_clusters=5, dim=2, n_samples=500,
                            aspect_ref=3, aspect_maxmin=1.5,
                            name="oblong")
# Define the data generator
data_generator = DataGenerator(archetype=oblong_clusters)

# Sample data points X and class labels y
X, y, _ = data_generator.synthesize()

User Guide / Documentation

For a full user guide and documentation, visit the project website: https://repliclust.org.

Citation

To reference repliclust in your work, please cite:

@article{Zellinger:2023,
  title   = {repliclust: Synthetic Data for Cluster Analysis},
  author  = {Zellinger, Michael J and B{\"u}hlmann, Peter},
  journal = {arXiv preprint arXiv:2303.14301},
  doi     = {10.48550/arXiv.2303.14301},
  year    = {2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repliclust-0.0.4.tar.gz (1.8 MB view hashes)

Uploaded Source

Built Distribution

repliclust-0.0.4-py3-none-any.whl (33.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page