Skip to main content

A package for running k-means on a Condor cluster

Project description

A Condor-powered K-means implementation
---------------------------------------
<p align="center">
<img src="https://github.com/tansey/condor-kmeans/blob/master/test/results.png?raw=true" alt="Example K-means Solution"/>
</p>


This package lets you run K-means on a really big dataset of vectors. You can even stream the vectors instead of loading them into memory, so long as you can store two lists of doubles the size of your vector count (one list for cluster assignment IDs and one for distance from each vector to its cluster).

## Installation

Installation is available via `pip`:

```
pip install condor-kmeans
```

## Usage

The package assumes you have a CSV file of vectors which you wish to cluster, with one vector per row. Once installed, you can simply run the `kmeans` command:

```
kmeans path/to/mydata.csv path/to/save/centroids.csv path/to/save/assignments.csv --num_clusters 30 --plusplus --stream --condor --condor_workers 100 --condor_username myusername
```

The above command will run k-means on the vectors stored in `mydata.csv` on condor with no more than 100 jobs at a time. It will save the resulting cluster centroids to `centroids.csv`, and the resulting vector-to-cluster assignments to `assignments.csv`. The `--plusplus` command specifies it should use k++ initialization. `--stream` says to stream `mydata.csv` from disk instead of loading it all into memory.

The current directory is used as the working directory. A working subdirectory named `condor` will be created. All temporary worker files are deleted after each batch of jobs is finished successfully, though the directory structure is maintained (feel free to just `rm -rf condor` afterward if you wish). If one of the workers fails, the master will throw an exception and alert you to the job that failed and where to find its output files; the temporary files will not be deleted if a worker fails.

Project details


Release history Release notifications | RSS feed

This version

0.9

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

condor-kmeans-0.9.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

condor_kmeans-0.9-py2.py3-none-any.whl (13.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file condor-kmeans-0.9.tar.gz.

File metadata

  • Download URL: condor-kmeans-0.9.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for condor-kmeans-0.9.tar.gz
Algorithm Hash digest
SHA256 bf18704528ebea016bc544b832409cb5942f30906786e4043eeb1c06a9a0458d
MD5 6cbfe96559ebd74f20eb674618096c89
BLAKE2b-256 9405723b443921a15055c305c94916f35f73ddc4e455ba888585f5363c60dd76

See more details on using hashes here.

File details

Details for the file condor_kmeans-0.9-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for condor_kmeans-0.9-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 34559e6c71a9dd62d4ac7f9c5b3bd6843a858e3bf4beed86e7eba5a8a064006d
MD5 c9f59c4c2cd97a1a6637c4af876f296a
BLAKE2b-256 4fa776fcfc342b07a18379b66e861a1351b696ab3dac56182c4e23eb4911e6a2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page