Skip to main content

scikit-query is a Python library for active query strategies in constrained clustering on top of SciPy and scikit-learn.

Project description

Documentation Status version Python codecov license Downloads

scikit-query

Clustering aims to group data into clusters without the help of labels, unlike classification algorithms. A well-known shortcoming of clustering algorithms is that they rely on an objective function geared toward specific types of clusters (convex, dense, well-separated), and hyperparameters that are hard to tune. Semi-supervised clustering mitigates these problems by injecting background knowledge in order to guide the clustering. Active clustering algorithms analyze the data to select interesting points to ask the user about, generating constraints that allow fast convergence towards a user-specified partition.

scikit-query is a library of active query strategies for constrained clustering inspired by scikit-learn and the now inactive active-semi-supervised-clustering library by Jakub Švehla.

It is focused on algorithm-agnostic query strategies, i.e. methods that do not rely on a particular clustering algorithm. From an input dataset, they produce a set of constraints by making insightful queries to an oracle. A variant for incremental constrained clustering is provided for applicable algorithms, taking a data partition into account.

In typical scikit way, the library is used by instanciating a class and using its fit method.

from skquery.pairwise import AIPC
from skquery.oracle import MLCLOracle

qs = AIPC()
oracle = MLCLOracle(truth=labels, budget=10)
constraints = qs.fit(dataset, oracle)

Algorithms

Algorithm Description Constraint type Works in incremental setting ? Source Date
Random sampling ML/CL, triplet :heavy_check_mark:
FFQS Neighborhood-based ML/CL :heavy_check_mark: Basu et al. 2004
MMFFQS (MinMax) Neighborhood-based, similarity ML/CL :heavy_check_mark: Mallapragada et al. 2008
NPU Neighborhood-based, information theory ML/CL :heavy_check_mark: Xiong et al. 2013
SASC SVDD, greedy approach ML/CL Abin & Beigy 2014
AIPC Fuzzy clustering, information theory ML/CL Zhang et al. 2019

Dependencies

scikit-query is developed on Python >= 3.10, and requires the following libraries :

  • pandas>=2.0.1
  • matplotlib>=3.7.1
  • numpy>=1.24.3
  • scikit-learn>=1.2.2
  • cvxopt>=1.3.1
  • scikit-fuzzy>=0.4.2
  • scipy>=1.10.1
  • plotly>=5.14.1

Contributors

FFQS, MinMax and NPU are based upon Jakub Švehla's implementation. Other algorithms have been implemented by Aymeric Beauchamp or his students from the University of Orléans :

  • Salma Badri, Elis Ishimwe, Brice Jacquesson, Matthéo Pailler (2023)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scikit_query-0.4.2.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scikit_query-0.4.2-py3-none-any.whl (37.4 kB view details)

Uploaded Python 3

File details

Details for the file scikit_query-0.4.2.tar.gz.

File metadata

  • Download URL: scikit_query-0.4.2.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for scikit_query-0.4.2.tar.gz
Algorithm Hash digest
SHA256 bfe3a6a3ed73fa3f8d332fc037e4fb13c5990b707ddd27e7ea1594769396b749
MD5 192917c6dd5bc5def5c9ac4afb0e515e
BLAKE2b-256 99898c7bf5588c29bdf8193f75f58c9637bee3d9a1237c62fd3355455ee07d7c

See more details on using hashes here.

File details

Details for the file scikit_query-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: scikit_query-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 37.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for scikit_query-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8d35053ab36574346437c2b3bdc6098ba805a25c64034c6ee3739e5b116a9d54
MD5 8d93d28283530ef3c2a1fc6782ec2af8
BLAKE2b-256 e62d2c576300b2f9b461cb492c67d06c2a0ac9b2403edee7a59b737b1a506e7f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page