Skip to main content

scikit-query is a Python library for active query strategies in constrained clustering on top of SciPy and scikit-learn.

Project description

Documentation Status version Python codecov license Downloads

scikit-query

Clustering aims to group data into clusters without the help of labels, unlike classification algorithms. A well-known shortcoming of clustering algorithms is that they rely on an objective function geared toward specific types of clusters (convex, dense, well-separated), and hyperparameters that are hard to tune. Semi-supervised clustering mitigates these problems by injecting background knowledge in order to guide the clustering. Active clustering algorithms analyze the data to select interesting points to ask the user about, generating constraints that allow fast convergence towards a user-specified partition.

scikit-query is a library of active query strategies for constrained clustering inspired by scikit-learn and the now inactive active-semi-supervised-clustering library by Jakub Švehla.

It is focused on algorithm-agnostic query strategies, i.e. methods that do not rely on a particular clustering algorithm. From an input dataset, they produce a set of constraints by making insightful queries to an oracle. A variant for incremental constrained clustering is provided for applicable algorithms, taking a data partition into account.

In typical scikit way, the library is used by instanciating a class and using its fit method.

from skquery.pairwise import AIPC
from skquery.oracle import MLCLOracle

qs = AIPC()
oracle = MLCLOracle(truth=labels, budget=10)
constraints = qs.fit(dataset, oracle)

Algorithms

Algorithm Description Constraint type Works in incremental setting ? Source Date
Random sampling ML/CL, triplet :heavy_check_mark:
FFQS Neighborhood-based ML/CL :heavy_check_mark: Basu et al. 2004
MMFFQS (MinMax) Neighborhood-based, similarity ML/CL :heavy_check_mark: Mallapragada et al. 2008
NPU Neighborhood-based, information theory ML/CL :heavy_check_mark: Xiong et al. 2013
SASC SVDD, greedy approach ML/CL Abin & Beigy 2014
AIPC Fuzzy clustering, information theory ML/CL Zhang et al. 2019

Dependencies

scikit-query is developed on Python >= 3.10, and requires the following libraries :

  • pandas>=2.0.1
  • matplotlib>=3.7.1
  • numpy>=1.24.3
  • scikit-learn>=1.2.2
  • cvxopt>=1.3.1
  • scikit-fuzzy>=0.4.2
  • scipy>=1.10.1
  • plotly>=5.14.1

Contributors

FFQS, MinMax and NPU are based upon Jakub Švehla's implementation. Other algorithms have been implemented by Aymeric Beauchamp or his students from the University of Orléans :

  • Salma Badri, Elis Ishimwe, Brice Jacquesson, Matthéo Pailler (2023)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scikit-query-0.4.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

scikit_query-0.4-py3-none-any.whl (36.8 kB view details)

Uploaded Python 3

File details

Details for the file scikit-query-0.4.tar.gz.

File metadata

  • Download URL: scikit-query-0.4.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for scikit-query-0.4.tar.gz
Algorithm Hash digest
SHA256 46560603249fcdc71d60e4ca8aafce0304d15caa4a4ea7975b2997a4e3fd329c
MD5 a5fb74186eda2477023e316fca46a3ec
BLAKE2b-256 fad675a84bf7c71ec2812a10dd8433d2669649aac923e11e81246f35b73d2095

See more details on using hashes here.

File details

Details for the file scikit_query-0.4-py3-none-any.whl.

File metadata

  • Download URL: scikit_query-0.4-py3-none-any.whl
  • Upload date:
  • Size: 36.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for scikit_query-0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5f02c8c8e02893598850cc6e250b8de77074095fb08d33f100725899115012d4
MD5 62490132f743a51986be1ed37c179e62
BLAKE2b-256 4e4408ed9bb962662dfa64c93bdb3c25a23d26e33a2e48f64a2f2fcc1dca3980

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page