Skip to main content

An implementation of hierarchical minibatch kmeans

Project description

An implementation of hierarchical kmeans that uses mini-batches for increased efficiency for large datasets. # Usage Have the .npy files all in one root feature directory to do kmeans over (they can be in subdirectories). Here is the usage:

$ python3 -m hkmeans_minibatch -h
usage: __main__.py [-h] -r ROOT_FEATURE_PATH -p FEATURES_PREFIX [-b BATCH_SIZE] -s SAVE_DIR -c CENTROID_DIR -hr HIERARCHIES -k CLUSTERS [-e EPOCHS]
optional arguments:
  -h, --help            show this help message and exit
  -r ROOT_FEATURE_PATH, --root-feature_path ROOT_FEATURE_PATH
                        path to folder containing all the feature files
  -p FEATURES_PREFIX, --features-prefix FEATURES_PREFIX
                        prefix that contains the desired files to read
  -b BATCH_SIZE, --batch-size BATCH_SIZE
                        batch_size to use for the minibatch kmeans
  -s SAVE_DIR, --save-dir SAVE_DIR
                        save directory for sorted hierarchical kmeans vectors
  -c CENTROID_DIR, --centroid-dir CENTROID_DIR
                        directory to save the centroids in
  -hr HIERARCHIES, --hierarchies HIERARCHIES
                        number of hierarchies to run the kmeans on
  -k CLUSTERS, --clusters CLUSTERS
                        number of clusters for each part of the hierarchy
  -e EPOCHS, --epochs EPOCHS
                        number of epochs to run the kmeans for each hierarchy

For optimal results have the batch size be larger than the number of vectors in each .npy file. The features prefix is the common prefix of the .npy files to kmeans over. The save directory should be an empty directory, which the program will fill with sorted vectors and delete after it is finished. The centroid directory should be an empty directory where all the centroids will be stored. Note that the centroids will be stored in separate files in the centroid directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hkmeans_minibatch-1.0.0.tar.gz (3.3 kB view hashes)

Uploaded Source

Built Distribution

hkmeans_minibatch-1.0.0-py2-none-any.whl (8.0 kB view hashes)

Uploaded Python 2

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page