Skip to main content

Scalable and parallel programming implementation of Affinity Propagation clustering

Project description

Overview

A scalable and concurrent programming implementation of Affinity Propagation clustering.

Affinity Propagation is a clustering algorithm based on passing messages between data-points.

Storing and updating matrices of ‘affinities’, ‘responsibilities’ and ‘similarities’ between samples can be memory-intensive. We address this issue through the use of an HDF5 data structure, allowing Affinity Propagation clustering of arbitrary large data-sets, where other Python implementations would return a MemoryError on most machines.

We also significantly speed up the computations by splitting them up across subprocesses, thereby taking full advantage of the resources of multi-core processors and bypassing the Global Interpreter Lock of the standard Python interpreter, CPython.

Installation and Requirements

Concurrent_AP requires Python 2.7 along with the following packages and a few modules from the Standard Python Library: * NumPy >= 1.9 * psutil * PyTables * scikit-learn * setuptools

Upon checking that the required dependencies are installed, you can upload Concurrent_AP from the official Python Package Index (PyPI) as follows: * open a terminal window; * type in the command: pip install Concurrent_AP

Usage and Command Line Options

See the docstrings associated to each function of the Concurrent_AP module for more information and an understanding of how different tasks are organized and shared among subprocesses.

Usage: Concurrent_AP [options] file_name, where file_name denotes the path where the data to be processed by Affinity Propagation clustering is held.

  • -c or --convergence: specify the number of iterations without change in the number of clusters that signals convergence (defaults to 15);

  • -d or --damping: the damping parameter of Affinity Propagation (defaults to 0.5);

  • -f or --file: option to specify the file name or file handle of the hierarchical data format where the matrices involved in Affinity Propagation clustering will be stored (defaults to a temporary file);

  • -i or --iterations: maximum number of message-passing iterations (defaults to 200);

  • -m or --multiprocessing: the number of processes to use;

  • -p or --preference: the preference parameter of Affinity Propagation (if not specified, will be determined as the median of the distribution of pairwise L2 Euclidean distances between samples);

  • -s or --similarities: determine if a similarity matrix has been pre-computed and stored in the HDF5 data structure accessible at the location specified through the command line option -f or --file (see above);

  • -v or --verbose: whether to be verbose.

Demo of Concurrent_AP

The following few lines illustrate the use of Concurrent_AP on the ‘Iris data-set’ from the UCI Machine Learning Repository. While the number of samples is here way too small for the benefits of the present multi-tasking implementation and the use of an HDF5 data structure to come fully into play, this data-set has the advantage of being well-known and prone to a quick comparison with scikit-learn’s version of Affinity Propagation clustering.

  • In a Python interpreter console, enter the following few lines, whose purpose is to create a file containing the Iris data-set that will be later subjected to Affinity Propagation clustering via Concurrent_AP:

>>> import numpy as np
>>> from sklearn import datasets

>>> iris = datasets.load_iris()
>>> data = iris.data
>>> with open('./iris_data.txt', 'w') as f:
    np.savetxt(f, data, fmt = '%.4f')
  • Open a terminal window.

  • Type in: Concurrent_AP --preference 5.47 --v iris_data.txt or simply Concurrent_AP iris_data.txt

The latter will automatically compute a preference parameter from the data-set.

When the rounds of message-passing among data-points have completed, a folder containing a file of cluster labels and a file of cluster centers indices both in tab-separated format is created in your current working directory.

References

Brendan J. Frey and Delbert Dueck. “Clustering by Passing Messages between Data Points”, Science Feb. 2007

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Concurrent_AP-1.2.tar.gz (14.0 kB view details)

Uploaded Source

File details

Details for the file Concurrent_AP-1.2.tar.gz.

File metadata

  • Download URL: Concurrent_AP-1.2.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for Concurrent_AP-1.2.tar.gz
Algorithm Hash digest
SHA256 52ee3b36c7cd8fd38d74ce08a48e3d1426bbb1a4d50671721bc37b0f8364ee90
MD5 9e7abc7693e2bb5899d7b96760733aec
BLAKE2b-256 def889932f731710eda22a55168ad409514c539776a33ce4faef75d25b9170d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page