Concurrent_AP·PyPI

Scalable and parallel programming implementation of Affinity Propagation clustering

These details have not been verified by PyPI

Project links

Project description

Overview

A scalable and concurrent programming implementation of Affinity Propagation clustering.

Affinity Propagation is a clustering algorithm based on passing messages between data-points.

Storing and updating matrices of ‘affinities’, ‘responsibilities’ and ‘similarities’ between samples can be memory-intensive. We address this issue through the use of an HDF5 data structure, allowing Affinity Propagation clustering of arbitrary large data-sets, where other Python implementations would return a MemoryError on most machines.

We also significantly speed up the computations by splitting them up across subprocesses, thereby taking full advantage of the resources of multi-core processors and bypassing the Global Interpreter Lock of the standard Python interpreter, CPython.

Installation and Requirements

Concurrent_AP requires Python 2.7 along with the following packages and a few modules from the Standard Python Library: * NumPy >= 1.9 * psutil * PyTables * scikit-learn * setuptools

Upon checking that the required dependencies are installed, you can upload Concurrent_AP from the official Python Package Index (PyPI) as follows: * open a terminal window; * type in the command: pip install Concurrent_AP

Usage and Command Line Options

See the docstrings associated to each function of the Concurrent_AP module for more information and an understanding of how different tasks are organized and shared among subprocesses.

Usage: Concurrent_AP [options] file_name, where file_name denotes the path where the data to be processed by Affinity Propagation clustering is held.

-c or --convergence: specify the number of iterations without change in the number of clusters that signals convergence (defaults to 15);
-d or --damping: the damping parameter of Affinity Propagation (defaults to 0.5);
-f or --file: option to specify the file name or file handle of the hierarchical data format where the matrices involved in Affinity Propagation clustering will be stored (defaults to a temporary file);
-i or --iterations: maximum number of message-passing iterations (defaults to 200);
-m or --multiprocessing: the number of processes to use;
-p or --preference: the preference parameter of Affinity Propagation (if not specified, will be determined as the median of the distribution of pairwise L2 Euclidean distances between samples);
-s or --similarities: determine if a similarity matrix has been pre-computed and stored in the HDF5 data structure accessible at the location specified through the command line option -f or --file (see above);
-v or --verbose: whether to be verbose.

Demo of Concurrent_AP

The following few lines illustrate the use of Concurrent_AP on the ‘Iris data-set’ from the UCI Machine Learning Repository. While the number of samples is here way too small for the benefits of the present multi-tasking implementation and the use of an HDF5 data structure to come fully into play, this data-set has the advantage of being well-known and prone to a quick comparison with scikit-learn’s version of Affinity Propagation clustering.

In a Python interpreter console, enter the following few lines, whose purpose is to create a file containing the Iris data-set that will be later subjected to Affinity Propagation clustering via Concurrent_AP:

>>> import numpy as np
>>> from sklearn import datasets

>>> iris = datasets.load_iris()
>>> data = iris.data
>>> with open('./iris_data.txt', 'w') as f:
    np.savetxt(f, data, fmt = '%.4f')

Open a terminal window.
Type in: Concurrent_AP --preference 5.47 --v iris_data.txt or simply Concurrent_AP iris_data.txt

The latter will automatically compute a preference parameter from the data-set.

When the rounds of message-passing among data-points have completed, a folder containing a file of cluster labels and a file of cluster centers indices both in tab-separated format is created in your current working directory.

References

Brendan J. Frey and Delbert Dueck. “Clustering by Passing Messages between Data Points”, Science Feb. 2007

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.4

Jan 15, 2016

1.3

Dec 7, 2015

This version

1.2

Nov 23, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Concurrent_AP-1.2.tar.gz (14.0 kB view details)

Uploaded Nov 23, 2015 Source

File details

Details for the file Concurrent_AP-1.2.tar.gz.

File metadata

Download URL: Concurrent_AP-1.2.tar.gz
Upload date: Nov 23, 2015
Size: 14.0 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for Concurrent_AP-1.2.tar.gz
Algorithm	Hash digest
SHA256	`52ee3b36c7cd8fd38d74ce08a48e3d1426bbb1a4d50671721bc37b0f8364ee90`
MD5	`9e7abc7693e2bb5899d7b96760733aec`
BLAKE2b-256	`def889932f731710eda22a55168ad409514c539776a33ce4faef75d25b9170d5`

See more details on using hashes here.

Concurrent_AP 1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Installation and Requirements

Usage and Command Line Options

Demo of Concurrent_AP

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes