Skip to main content

A library for clusterlet induction, that is, sets of fair clusters.

Project description

Clusfairing

Clusfairing is a Python library collecting algorithms for fair clustering, mainly aimed at clusterlet-based approaches. A clusterlet defines (like fair coresets and fairlets) a clustering of data which respects some notion of fairness or balance. Under some assumptions, (centroids of) clusterlets can be clustered themselves achieve a fair clustering where each cluster approximately follows the original label distribution.


Quickstart

Installation

mkvirtualenv -p python3.12 clusfairing
pip install -r requirements.txt

Getting started

import numpy

from clusterlets.extractors import RandomExtractor


# generate some data
data = numpy.random.rand(1000, 5)
labels = numpy.random.choice([0, 1], 1000, replace=True)

# creates a clusterlet extractor
extractor = RandomExtractor(random_state=42)
# extracts clusterlets, assigning a clusterlet to each data point
extracted_clusterlets = extractor.extract(data, labels, size_per_label="auto")

# you can access the data of each clusterlet!
for clusterlet in extracted_clusterlets:
    print(data[clusterlet.index])

Clusterlets. The Clusterlet class implements a clusterlet, which is defined by

  • _id: int an id to identify it
  • label_frequecies: Optional[numpy.ndarray] label frequencies associated to id
  • centroid: Optional[numpy.ndarray] a centroid
  • index: Optional[numpy.ndarray] indicating which instances of the starting data compose this clusterlet. Used in place of the data itself for a lighter object

Since clusterlets are extracted from a dataset, data[clusterlet.index] yields the data of the clusterlet. Clusterlets support == and hash, thus can be aggregated into set[Clusterlet], and used as dictionary keys.

Extractors. Clusterlets are extracted with a set of extractors implementing the ClusterletExtractor interface (extractors.*), which extracts clusterlets through the extract(data, labels) method.

  • RandomExtractor selects random subsets of each label, then pairs them to satisfy dataset balance. One can also specify how many samples per label each clusterlet must have with a parameter dictionary size_per_label
  • KMeansExtractor clusters each label separately (through K-Means), creating label-specific clusterlets. Then, matches clusterlets to achieve both clustering and balance.

The KMeansExtractor is an implementation of the ClusteringExtractor interface, which can be adapted to any clustering algorithm by overriding the cluster(data) method.

Matchers. Matchers (extractors.matches.*) are objects which "match" existing clusterlets, creating larger ones, i.e., they cluster clusterlets. A Matcher implements a match(clusterlets, **kwargs) method, which is given a list of clusterings (one per label), and a desired label balance to achieve. Currently, we implement:

  • PinballMatcher, which provides matches by hopping hops times through two sets of clusterlets of different labels, each hop following the clusterlet of opposite label at minimum distance.
  • GreedyPinballMatcher, which greedily matches clusterlets maximizing some given objective:
    • GreedyBalanceMatcher maximizes label balance
    • GreedyDPbMatcher maximizes clusterlet distance
  • CentroidMatcher, which creates a set of candidate partitions of the set of clusterlets, then scores them for balance and compactness. Note: only a subsample of size sample_size is tested due to the superexponential number of possible partitions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusterlets-0.0.1.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clusterlets-0.0.1-py3-none-any.whl (26.2 kB view details)

Uploaded Python 3

File details

Details for the file clusterlets-0.0.1.tar.gz.

File metadata

  • Download URL: clusterlets-0.0.1.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for clusterlets-0.0.1.tar.gz
Algorithm Hash digest
SHA256 53308a2cd2a3053565bd7d8a0e21d92fd34faee81ddfff253379e6d94c661715
MD5 ebc4d2bdb2e092b872f01e2a19f5b0fa
BLAKE2b-256 771e4b09f928d3006df77282af85a0519575516019bda1f840420fb44223b928

See more details on using hashes here.

File details

Details for the file clusterlets-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: clusterlets-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 26.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for clusterlets-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3025d2a00524e07589debfe47df2c9b0c723284a0a87768208008b3997bf9f72
MD5 f5a1495b98d0aa933d7f2b48511c76c0
BLAKE2b-256 688a047da97b3069feba6c7ca9550f922fe634e08e434eb932ed544e754fd633

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page