Skip to main content

A package for generating synthetic clusters, with parameters to customize different aspects of the complexity of the cluster structure

Project description

Example gif of HAWKS

HAWKS is a tool for generating controllably difficult synthetic data, used primarily for clustering.

This repo is associated with the following paper:

  1. Shand, C., Allmendinger, R., Handl, J., Webb, A., & Keane, J. (2019, July). Evolving controllably difficult datasets for clustering. In Proceedings of the Genetic and Evolutionary Computation Conference (pp. 463-471). https://doi.org/10.1145/3321707.3321761 (Nominated for best paper on the evolutionary machine learning track at GECCO’19)

The academic/technical details can be found there. What follows here is a practical guide to using HAWKS to generate synthetic data.

If you use HAWKS to generate data that forms part of a paper, please cite the paper above and link to this repo.

Installation

Installation is available through pip by:

pip install hawks

or by cloning this repo (and installing locally using pip install .). HAWKS was written for Python 3.6+. Other dependencies are specified in the setup.py file.

Running HAWKS

The parameters of hawks are configured via a config file system. Details of the parameters are found in the documentation. For any parameters that are not specified, default values will be used (as defined in hawks/defaults.json).

The example below illustrates how to run hawks. Either a dictionary or a path to a JSON config can be provided to override any of the default values. Further examples can be found here.

"""Single, simple HAWKS run, with KMeans applied to the best dataset
"""
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import hawks

# Set the magic seed number
SEED_NUM = 42
# Set the seed number in the config
config = {
    "hawks": {
        "folder_name": "simple_example",
        "seed_num": SEED_NUM
    },
    "dataset": {
        "num_clusters": 5
    },
    "objectives": {
        "silhouette": {
            "target": 0.9
        }
    }
}
# Any missing parameters will take from hawks/defaults.json
generator = hawks.create_generator(config)
# Run the generator
generator.run()
# Let's plot the best individual found
generator.plot_best_indivs(show=True)
# Get the best dataset found and it's labels
datasets, label_sets = generator.get_best_dataset()
# Stored as a list for multiple runs
data, labels = datasets[0], label_sets[0]
# Run KMeans on the data
km = KMeans(
    n_clusters=len(np.unique(labels)), random_state=SEED_NUM
).fit(data)
# Plot the output of KMeans
hawks.plotting.scatter_prediction(data, km.labels_)
# Get the Adjusted Rand Index for KMeans on the data
ari = adjusted_rand_score(labels, km.labels_)
print(f"ARI: {ari}")

Documentation

For further information about how to use HAWKS, including examples, please see the documentation.

Issues

As this work is still in development, plain sailing is not guaranteed. If you encounter an issue, first ensure that hawks is running as intended by navigating to the tests directory, and running python tests.py. If any test fails, please add details of this alongside your original problem to an issue on the GitHub repo.

Contributing

At present, this is primarily academic work, so future developments will be released here after they have been published. If you have any suggestions or simple feature requests for HAWKS as a tool to use, please raise that on the GitHub repo.

I have various directions for HAWKS at present, and can only work on a subset of them, and so involvement with more people would be great. If you would like to extend this work or collaborate, please contact me.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hawks-0.2.0.tar.gz (42.0 kB view details)

Uploaded Source

Built Distribution

hawks-0.2.0-py3-none-any.whl (60.5 kB view details)

Uploaded Python 3

File details

Details for the file hawks-0.2.0.tar.gz.

File metadata

  • Download URL: hawks-0.2.0.tar.gz
  • Upload date:
  • Size: 42.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.15.0 CPython/3.6.9

File hashes

Hashes for hawks-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1958a0a900a3fc48be047ed5e07e6635440778c4c1d0dce887c7ed493d5fb618
MD5 db238f49c43b962faf7c0849f872214b
BLAKE2b-256 2216358383134c34674c59f4e42d1717efe104e7ca8f0d5a927e0cb9191d0803

See more details on using hashes here.

File details

Details for the file hawks-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: hawks-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 60.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.15.0 CPython/3.6.9

File hashes

Hashes for hawks-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 68980c8669c72f82042bb655273813ecbcd93a80f7e784b0225773dacafb57b7
MD5 ae96471c63f8122ab0de535dfecc2074
BLAKE2b-256 c1232aa0d521822c949d8037034f3564978a86d3c17ea3300533855948b52e50

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page