Skip to main content

A package for generating synthetic clusters, with parameters to customize different aspects of the complexity of the cluster structure

Project description

HAWKS Data Generator

HAWKS is a tool for generating controllably difficult synthetic data, used primarily for clustering. This repo is associated with the following paper:

  1. Shand, C, Allmendinger, R, Handl, J, Webb, A & Keane, J 2019, Evolving Controllably Difficult Datasets for Clustering. in Proceedings of the Annual Conference on Genetic and Evolutionary Computation (GECCO '19) . The Genetic and Evolutionary Computation Conference, Prague, Czech Republic, 13/07/19. https://doi.org/10.1145/3321707.3321761

The academic/technical details can be found there. What follows here is a practical guide to using this tool to generate synthetic data.

If you use this tool to generate data that forms part of a paper, please consider either linking to this work or citing the paper above.

Installation

Installation is available through pip by:

pip install hawks

or by cloning this repo (and installing locally using pip install .).

Running HAWKS

Like any other package, you need to import hawks in order to use it. The parameters of hawks are configured via a config file system. Details of the parameters are found in the user guide. For any parameters that are not specified, default values will be used (as defined in hawks/defaults.json).

The example below illustrates how to run hawks. Either a dictionary or a path to a JSON config can be provided to override any of the default values.

from pathlib import Path

import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import hawks

# Fix the seed number
config = {
    "hawks": {
        "seed_num": 42
    }
}
# Any missing parameters will take the default seen in configs/defaults.json
generator = hawks.create_generator(config)
# Run the generator
generator.run()
# Get the best dataset found and it's labels
data, labels = generator.get_best_dataset()
# # Plot the best dataset to see how it looks
# generator.plot_best_indiv()
# Run KMeans on the data
km = KMeans(
    n_clusters=len(np.unique(labels)), random_state=42
).fit(data)
# Get the Adjusted Rand Index for KMeans on the data
ari = adjusted_rand_score(labels, km.labels_)
print(f"ARI: {ari}")

User Guide

For a more detailed explanation of the parameters and how to use HAWKS, please read the user guide.

Issues

As this work is still in development, plain sailing is not guaranteed. If you encounter an issue, first ensure that hawks is running as intended by navigating to the tests directory, and running python tests.py. If any test fails, please add details of this alongside your original problem to an issue on the github repo.

Feature Requests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for hawks, version 0.0.2
Filename, size & hash File type Python version Upload date
hawks-0.0.2-py3-none-any.whl (27.2 kB) View hashes Wheel py3
hawks-0.0.2.tar.gz (24.4 kB) View hashes Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page