Skip to main content

A library for clusterlet induction, that is, sets of fair clusters.

Project description

Clusterlets

clusterlets is a Python library collecting algorithms for fair clustering, mainly aimed at clusterlet-based approaches. A clusterlet defines (like fair coresets and fairlets) a clustering of data which respects some notion of fairness or balance. Under some assumptions, (centroids of) clusterlets can be clustered themselves achieve a fair clustering where each cluster approximately follows the original label distribution.


Quickstart

Installation

Developed on Python 3.11.

pip install clusterlets

Getting started

import numpy

from clusterlets.extractors import RandomExtractor


# generate some data
data = numpy.random.rand(1000, 5)
labels = numpy.random.choice([0, 1], 1000, replace=True)

# creates a clusterlet extractor
extractor = RandomExtractor(random_state=42)
# extracts clusterlets, assigning a clusterlet to each data point
extracted_clusterlets = extractor.extract(data, labels, size_per_label="auto")

# you can access the data of each clusterlet!
for clusterlet in extracted_clusterlets:
    print(data[clusterlet.index])

Clusterlets. The Clusterlet class implements a clusterlet, which is defined by

  • _id: int an id to identify it
  • label_frequecies: Optional[numpy.ndarray] label frequencies associated to id
  • centroid: Optional[numpy.ndarray] a centroid
  • index: Optional[numpy.ndarray] indicating which instances of the starting data compose this clusterlet. Used in place of the data itself for a lighter object

Since clusterlets are extracted from a dataset, data[clusterlet.index] yields the data of the clusterlet. Clusterlets support == and hash, thus can be aggregated into set[Clusterlet], and used as dictionary keys.

Extractors. Clusterlets are extracted with a set of extractors implementing the ClusterletExtractor interface (extractors.*), which extracts clusterlets through the extract(data, labels) method.

  • RandomExtractor selects random subsets of each label, then pairs them to satisfy dataset balance. One can also specify how many samples per label each clusterlet must have with a parameter dictionary size_per_label
  • KMeansExtractor clusters each label separately (through K-Means), creating label-specific clusterlets. Then, matches clusterlets to achieve both clustering and balance.

The KMeansExtractor is an implementation of the ClusteringExtractor interface, which can be adapted to any clustering algorithm by overriding the cluster(data) method.

Matchers. Matchers (extractors.matches.*) are objects which "match" existing clusterlets, creating larger ones, i.e., they cluster clusterlets. A Matcher implements a match(clusterlets, **kwargs) method, which is given a list of clusterings (one per label), and a desired label balance to achieve. Currently, we implement:

  • PinballMatcher, which provides matches by hopping hops times through two sets of clusterlets of different labels, each hop following the clusterlet of opposite label at minimum distance.
  • GreedyPinballMatcher, which greedily matches clusterlets maximizing some given objective:
    • GreedyBalanceMatcher maximizes label balance
    • GreedyDPbMatcher maximizes clusterlet distance
  • CentroidMatcher, which creates a set of candidate partitions of the set of clusterlets, then scores them for balance and compactness. Note: only a subsample of size sample_size is tested due to the superexponential number of possible partitions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusterlets-0.0.2.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clusterlets-0.0.2-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file clusterlets-0.0.2.tar.gz.

File metadata

  • Download URL: clusterlets-0.0.2.tar.gz
  • Upload date:
  • Size: 22.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for clusterlets-0.0.2.tar.gz
Algorithm Hash digest
SHA256 9f2fc1e4609bfe6bad06eddc1effbac72c5c4269fe59e4d3094e034743fa6dbc
MD5 41733c1234af5aebbeeac2dff5b418ea
BLAKE2b-256 d070c6208dc672d3530a11f90fd948d5480d963eb445604f5e6ec867e44e26ce

See more details on using hashes here.

File details

Details for the file clusterlets-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: clusterlets-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for clusterlets-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4b144d39c929b98c9198c33cac2615c7bdf29ba6ed9ea332ab1fd7ead738899d
MD5 88c91652d9320e6620c256a30b226972
BLAKE2b-256 76c389a811f9932ba37da642f69ec334e4868c0fa1f73ff3a98cf2d0eb18866e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page