Skip to main content

A library for clusterlet induction, that is, sets of fair clusters.

Project description

Clusterlets

clusterlets is a Python library collecting algorithms for fair clustering, mainly aimed at clusterlet-based approaches. A clusterlet defines (like fair coresets and fairlets) a clustering of data which respects some notion of fairness or balance. Under some assumptions, (centroids of) clusterlets can be clustered themselves achieve a fair clustering where each cluster approximately follows the original label distribution.


Quickstart

Installation

Developed on Python 3.11.

pip install clusterlets

Getting started

import numpy

from clusterlets.extractors import RandomExtractor


# generate some data
data = numpy.random.rand(1000, 5)
labels = numpy.random.choice([0, 1], 1000, replace=True)

# creates a clusterlet extractor
extractor = RandomExtractor(random_state=42)
# extracts clusterlets, assigning a clusterlet to each data point
extracted_clusterlets = extractor.extract(data, labels, size_per_label="auto")

# you can access the data of each clusterlet!
for clusterlet in extracted_clusterlets:
    print(data[clusterlet.index])

Clusterlets. The Clusterlet class implements a clusterlet, which is defined by

  • _id: int an id to identify it
  • label_frequecies: Optional[numpy.ndarray] label frequencies associated to id
  • centroid: Optional[numpy.ndarray] a centroid
  • index: Optional[numpy.ndarray] indicating which instances of the starting data compose this clusterlet. Used in place of the data itself for a lighter object

Since clusterlets are extracted from a dataset, data[clusterlet.index] yields the data of the clusterlet. Clusterlets support == and hash, thus can be aggregated into set[Clusterlet], and used as dictionary keys.

Extractors. Clusterlets are extracted with a set of extractors implementing the ClusterletExtractor interface (extractors.*), which extracts clusterlets through the extract(data, labels) method.

  • RandomExtractor selects random subsets of each label, then pairs them to satisfy dataset balance. One can also specify how many samples per label each clusterlet must have with a parameter dictionary size_per_label
  • KMeansExtractor clusters each label separately (through K-Means), creating label-specific clusterlets. Then, matches clusterlets to achieve both clustering and balance.

The KMeansExtractor is an implementation of the ClusteringExtractor interface, which can be adapted to any clustering algorithm by overriding the cluster(data) method.

Matchers. Matchers (extractors.matches.*) are objects which "match" existing clusterlets, creating larger ones, i.e., they cluster clusterlets. A Matcher implements a match(clusterlets, **kwargs) method, which is given a list of clusterings (one per label), and a desired label balance to achieve. Currently, we implement:

  • PinballMatcher, which provides matches by hopping hops times through two sets of clusterlets of different labels, each hop following the clusterlet of opposite label at minimum distance.
  • GreedyPinballMatcher, which greedily matches clusterlets maximizing some given objective:
    • GreedyBalanceMatcher maximizes label balance
    • GreedyDPbMatcher maximizes clusterlet distance
  • CentroidMatcher, which creates a set of candidate partitions of the set of clusterlets, then scores them for balance and compactness. Note: only a subsample of size sample_size is tested due to the superexponential number of possible partitions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusterlets-0.0.3.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clusterlets-0.0.3-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file clusterlets-0.0.3.tar.gz.

File metadata

  • Download URL: clusterlets-0.0.3.tar.gz
  • Upload date:
  • Size: 22.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for clusterlets-0.0.3.tar.gz
Algorithm Hash digest
SHA256 238257e3052c206197a485c356f38a3ec2db41d6fe65476cdf88da4c6a66aadd
MD5 242acf29e6c607e977fd680b0dfd0a3c
BLAKE2b-256 03c2869d7705675804231f751b283da6bcdcb5d60d5b98b5da6ed2259a2f5ec0

See more details on using hashes here.

File details

Details for the file clusterlets-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: clusterlets-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for clusterlets-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6c48d7e74948e737a29fc936ea62c02663280e0789fdafdd545a487114b5c740
MD5 80e399212c412d10329821d6d845ce9c
BLAKE2b-256 9aa3de377b92dc93f17fe0d54ef5e3d3b44d62843caa0b7bdfbcaab1d07d4440

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page