Skip to main content

Compute the mutual information between two clusterings of the same objects

Project description

clustering-mi

Actions Status Documentation Status

PyPI version PyPI platforms

Computing mutual information between clusterings

Maximilian Jerdee, Alec Kirkley, and Mark Newman

A python package to compute the mutual information between two clusterings of the same set of objects. This implementation includes a number of variations and normalizations of the mutual information.

It particularly implements the reduced mutual information (RMI) as described in Jerdee, Kirkley, and Newman (2024), which corrects the usual measure's bias towards labelings with too many groups. The asymmetric normalization of Jerdee, Kirkley, and Newman (2023) is also included, to remove the bias of the typical symmetric normalization.

Installation

clustering-mi may be installed through pip:

pip install clustering-mi

or be built locally by cloning this repository and running

pip install .

in the base directory.

Typical usage

Once installed, the package can be imported as

import clustering_mi

Note that this is not import clustering-mi.

We can load two labelings in a number of ways, the names of the groups used are irrelevant:

# As arrays:
labels1 = ["red", "red", "red", "blue", "blue", "blue", "green", "green"]
labels2 = [1, 1, 1, 1, 2, 2, 2, 2]

# As a contingency table, i.e. a matrix that counts label co-occurrences.
# Columns are the first labeling, rows are the second labeling:
contingency_table = [[3, 1, 0], [0, 2, 2]]

# Or as a space-separated file:
"""
red 1
red 1
red 1
blue 1
blue 2
blue 2
green 2
green 2
"""
filename = "data/example.txt"

We then use the package to compute the mutual information (in bits) between the two labelings from any format:

mutual_information = clustering_mi.mutual_information(
    labels1, labels2
)  # Defaults to the reduced mutual information (RMI)
mutual_information = clustering_mi.mutual_information(
    contingency_table
)  # Reads the contingency table
mutual_information = clustering_mi.mutual_information(filename)  # Reads the file

print(f"Mutual Information: {mutual_information:.3f} (bits)")

# Can compute other variants of the mutual information by specifying the type parameter.
adjusted_mutual_information = clustering_mi.mutual_information(
    labels1, labels2, variation="adjusted"
)  # Correcting for chance
simple_mutual_information = clustering_mi.mutual_information(
    labels1, labels2, variation="traditional"
)  # Traditional mutual information

We can also compute the normalized mutual information (NMI) between the two labelings, a measure bounded above by 1 in the case where the two labelings are identical. Depending on the application, a symmetric or asymmetric normalization may be appropriate.

# Symmetric normalization
normalized_mutual_information = clustering_mi.normalized_mutual_information(
    labels1, labels2, normalization="mean"
)
normalized_traditional_mutual_information = clustering_mi.normalized_mutual_information(
    labels1, labels2, variation="traditional", normalization="mean"
)

print(
    f"(symmetric) Normalized Mutual Information (labels1 <-> labels2): {normalized_mutual_information:.3f}"
)

# Asymmetric normalization, measure how much the first labeling tells us about the second,
# as a fraction of all there is to know about the second labeling.
# This form is appropriate when the second labeling is a "ground truth" and the first is a prediction.
asymmetric_normalized_mutual_information_1_2 = (
    clustering_mi.normalized_mutual_information(
        labels1, labels2, normalization="second"
    )
)
asymmetric_normalized_mutual_information_2_1 = (
    clustering_mi.normalized_mutual_information(labels1, labels2, normalization="first")
)

print(
    f"(asymmetric) Normalized Mutual Information (labels1 -> labels2): {asymmetric_normalized_mutual_information_1_2:.3f}"
)
print(
    f"(asymmetric) Normalized Mutual Information (labels2 -> labels1): {asymmetric_normalized_mutual_information_2_1:.3f}"
)

Further usage examples can be found in the examples directory of the repository and the package documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clustering_mi-0.1.0.tar.gz (143.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clustering_mi-0.1.0-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file clustering_mi-0.1.0.tar.gz.

File metadata

  • Download URL: clustering_mi-0.1.0.tar.gz
  • Upload date:
  • Size: 143.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for clustering_mi-0.1.0.tar.gz
Algorithm Hash digest
SHA256 630e5299c5a7ffe7d2308310c508849727770005372b05ba0722145ee567869e
MD5 04a063c510ebbc1e5db2285a5865bf39
BLAKE2b-256 565357c09e7d8e2775b13bfaad4d74ed1f511e2179104256b16d79a934e4c7a1

See more details on using hashes here.

File details

Details for the file clustering_mi-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: clustering_mi-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for clustering_mi-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b5a3340119830b50949e7c1c6be6b111a0e594d25932e5431b37f3c84d39aaaf
MD5 6f35136cd010255f4ca5056d3e4c08f7
BLAKE2b-256 43a7339ead195978e3c83a7e69159a1bd3e9876b1d4c1093c3891628ca4902ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page