Skip to main content

sequential Information Bottleneck

Project description

sequential Information Bottleneck (sIB)

GitHub Actions CI status

Scope

This project provides an efficient implementation of the text clustering algorithm "sequential Information Bottleneck" (sIB), introduced by Slonim, Friedman and Tishby (2002). The project is packaged as a python library with a cython-wrapped C++ extension for the partition optimization code. A pure python implementation is included as well. The implementation is documented here.

Installation

pip install sib-clustering

Usage

The main class in this library is SIB, which implements the clustering interface of SciKit Learn, providing methods such as fit(), fit_transform(), fit_predict(), etc.

The sample code below clusters the 18.8K documents of the 20-News-Groups dataset into 20 clusters:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn import metrics
from sib import SIB

# read the dataset
dataset = fetch_20newsgroups(subset='all', categories=None,
                             shuffle=True, random_state=256)

gold_labels = dataset.target
n_clusters = np.unique(gold_labels).shape[0]

# create count vectors using the 10K most frequent words
vectorizer = CountVectorizer(max_features=10000)
X = vectorizer.fit_transform(dataset.data)

# SIB initialization and clustering; parameters:
# perform 10 random initializations (n_init=10); the best one is returned.
# up to 15 optimization iterations in each initialization (max_iter=15)
# use all cores in the running machine for parallel execution (n_jobs=-1)
sib = SIB(n_clusters=n_clusters, random_state=128, n_init=10,
          n_jobs=-1, max_iter=15, verbose=True)
sib.fit(X)

# report standard clustering metrics
print("Homogeneity: %0.3f" % metrics.homogeneity_score(gold_labels, sib.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(gold_labels, sib.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(gold_labels, sib.labels_))
print("Adjusted Rand-Index: %.3f" % metrics.adjusted_rand_score(gold_labels, sib.labels_))

Expected result:

sIB information stats on best partition:
	I(T;Y) = 0.5685, H(T) = 4.1987
	I(T;Y)/I(X;Y) = 0.1468
	H(T)/H(X) = 0.2956
Homogeneity: 0.616
Completeness: 0.633
V-measure: 0.624
Adjusted Rand-Index: 0.507

See the Examples directory for more illustrations and a comparison against K-Means.

License

Copyright IBM Corporation 2020

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

If you would like to see the detailed LICENSE click here.

Authors

If you have any questions or issues you can create a new issue here.

Reference

N. Slonim, N. Friedman, and N. Tishby (2002). Unsupervised Document Classification using Sequential Information Maximization. SIGIR 2002. https://dl.acm.org/doi/abs/10.1145/564376.564401

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sib-clustering-0.1.7.tar.gz (145.6 kB view details)

Uploaded Source

Built Distributions

sib_clustering-0.1.7-cp39-cp39-win_amd64.whl (210.7 kB view details)

Uploaded CPython 3.9 Windows x86-64

sib_clustering-0.1.7-cp39-cp39-win32.whl (199.2 kB view details)

Uploaded CPython 3.9 Windows x86

sib_clustering-0.1.7-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (573.4 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

sib_clustering-0.1.7-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl (540.8 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ i686 manylinux: glibc 2.5+ i686

sib_clustering-0.1.7-cp39-cp39-macosx_10_9_x86_64.whl (223.0 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

sib_clustering-0.1.7-cp38-cp38-win_amd64.whl (210.7 kB view details)

Uploaded CPython 3.8 Windows x86-64

sib_clustering-0.1.7-cp38-cp38-win32.whl (199.2 kB view details)

Uploaded CPython 3.8 Windows x86

sib_clustering-0.1.7-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (581.9 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

sib_clustering-0.1.7-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl (550.1 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ i686 manylinux: glibc 2.5+ i686

sib_clustering-0.1.7-cp38-cp38-macosx_10_9_x86_64.whl (221.1 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

sib_clustering-0.1.7-cp37-cp37m-win_amd64.whl (210.1 kB view details)

Uploaded CPython 3.7m Windows x86-64

sib_clustering-0.1.7-cp37-cp37m-win32.whl (198.2 kB view details)

Uploaded CPython 3.7m Windows x86

sib_clustering-0.1.7-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (550.2 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

sib_clustering-0.1.7-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl (515.7 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ i686 manylinux: glibc 2.5+ i686

sib_clustering-0.1.7-cp37-cp37m-macosx_10_9_x86_64.whl (221.3 kB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

sib_clustering-0.1.7-cp36-cp36m-win_amd64.whl (219.9 kB view details)

Uploaded CPython 3.6m Windows x86-64

sib_clustering-0.1.7-cp36-cp36m-win32.whl (203.5 kB view details)

Uploaded CPython 3.6m Windows x86

sib_clustering-0.1.7-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (550.3 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

sib_clustering-0.1.7-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl (515.0 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ i686 manylinux: glibc 2.5+ i686

sib_clustering-0.1.7-cp36-cp36m-macosx_10_9_x86_64.whl (221.2 kB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file sib-clustering-0.1.7.tar.gz.

File metadata

  • Download URL: sib-clustering-0.1.7.tar.gz
  • Upload date:
  • Size: 145.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for sib-clustering-0.1.7.tar.gz
Algorithm Hash digest
SHA256 4741e340ab7500dc1abab5eda913f63b167a5e4cffa9a7d5f6f5927c39528cee
MD5 743c0eb73fa18189943be7033cd894e5
BLAKE2b-256 af691b70d779db97d39edea214847d85906282085c148f83586169ec4b6e1c1d

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 141760e637d2680bfe153f74ccaadc4d61a2d6138d2ce826c99ef8e7b47663be
MD5 68be02d5ca486d760d52bf3eea021ce1
BLAKE2b-256 4a1557aaca67fdc4c0ebeee02f4033eb9876c81a12a336bd5f9f642390c59eff

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp39-cp39-win32.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp39-cp39-win32.whl
Algorithm Hash digest
SHA256 9b97e3692042621d560a2bdebcf72a2d3221b3d0c12782e737ff5752dc80d012
MD5 870fafcda7522b8a80846be1b2708b11
BLAKE2b-256 3046bbfce812b2083455e7414693bae91ae20fc41b9fe38d48551a8de3999e3e

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 e732d74204414ad5f7a58216a8535c01b1b04f3895f532998917578c5756c69f
MD5 1f190c63718a5d2a6e7e07ffd9ae8c7e
BLAKE2b-256 10331b5454e7b2ba0b3ed263811b6681f565f24bc850a4a075ecb79a5a71e825

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 a3f223c1872d24b43e9ded497a7f30e4e07bb807e1b6df9745b298280cac5936
MD5 b6454200a831e7f75237af2f03b9c4ca
BLAKE2b-256 a4480897e56f674fce2ab92f90c8e0531bab5e8c1ff2877194435871ce2aeea0

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5c6d9ff37a5d84ddbd405deafc484db6f31b9b3aa4074c8800b5b54d114bb3c3
MD5 7d5f916940c9b30ba1ba04df0325de0a
BLAKE2b-256 6b06d7422069086bb6a90fac828c2c35ca44c0169f4056d7381ffe69e33769e0

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 e5a1ecf664cae4483c76c3e41e4c5d9b45d3a8a3a20496fac0b098693d156979
MD5 97fb78b745222fd37955b61b70b98dd3
BLAKE2b-256 8ec60970abcc639699c1829afe3f863c694036fa9470a5dcbc2a99a0e5c17a1a

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp38-cp38-win32.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 615f64da9a105c7aed5fe432f8660da6350b1dab18d8e944409fd250e7e64f14
MD5 a3397422cfef74fac59452c871b91b52
BLAKE2b-256 c62a701003aa2a4e040dcbcef630d49e20f5d1482b5cb335eb2a98dafff994c7

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f47349a2e7b196a9856d49f1bbd5b91375d94938d2f773b9c2a2b86d31d7fdb5
MD5 23d76bf907c4b11f3032656ffb325d07
BLAKE2b-256 6707306243edc7c5c14a1718e78228d01bb48b8405813e5993ba25ee96d5e307

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 13b061dbf55ad6498e6c1e2c9d1576b76d8585418fbe3a8432fab2d31de69ab4
MD5 f8aec754eae9425a9bf2f168a4e23d5b
BLAKE2b-256 034633cb050ac08111a6acb57142ac46f7db8308f034301a4e7b69161e6c66d5

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6d1a92db7a7b28de8c8bad2b95a528fbe86ec4b0cdbb36625fc88dbc33944d6a
MD5 20ee11b4e89c2b18e4518c82c039b827
BLAKE2b-256 c331c43f3774b7f8026e47389b5f7942d77cccbbb91b955891565a72af1baf35

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 4ea8dc6949a020a7032d3a617c64dcf35af191584db6d19835d91d12fa01c39a
MD5 daaa4fcca8fdba830b4fba8c61b2d6f5
BLAKE2b-256 6df3b53cc7217cbf1a314c72925440fa6c43845e37389bbf7034904965507573

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp37-cp37m-win32.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 49dfe9f04bf06c8056834ee95d0244b9c90f5a7ae7b7e2f4d885e819f2430dfc
MD5 605faabad5a4936fa20ff4a067c7ac7f
BLAKE2b-256 f354c91111d36f25ba7306ea92631acfda7018b9e0b0fd82d41b041dfd271618

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 aae8370ee41195f9956a618cd9ce80c051037a492d8d4a2c133d599323ccbd23
MD5 2d3dbffad5ba3ba3c7f77a44a90265c3
BLAKE2b-256 38eacb11e482afc70d09242f428abb34a7d65ce333fd1888ae0059e58072ed77

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 2031cd99505c7f588a7a86d9df2d23368b7a56368d38b170d0b067e622013ee7
MD5 2437fe975cca816c871c0c5c953fa335
BLAKE2b-256 8024ea0987536fff93c659e1a690cdc3115c5394b02a0a942a8d8cf311d85ec1

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 79202ad2f26cb91b0a887be7a1af59a089b8dc542c8c7ca9887b18bc657a7f90
MD5 a72ac103f27fc5e45956394084af572a
BLAKE2b-256 2564a62c3f85689f61fa8d98647d86d91da037696908bcb65c405fb0f7cc5a38

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp36-cp36m-win_amd64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 df9f30b5502d88e9263f6cad20cd8b26f3434942a8e81a03fbaa8012a55294f3
MD5 468a1bd8130dc61afbfc88bce8592b3a
BLAKE2b-256 234a478ae1ae645cae20521be5586e3a8f990730526129ae21a74991c948b263

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp36-cp36m-win32.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 402c467029d13105aa8a2287860f7dc20cdfe5b07ee43751e6c491335c908d40
MD5 0606c9eaee7b439e834408e3667c21e0
BLAKE2b-256 a8e026ea1d514ba84434a2e1d213c5b98ae3b43d184355be784c7b3a25e220c1

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 d352cb14738c3959a1b66c3b41f18cae1143747983d39a6cb595d5945088c5d7
MD5 1b36d751fbad60cb0ff3ad3b4326b9d1
BLAKE2b-256 2376be02fd68a5b0930decc565f5686f9c6d299c926e3db2d8d75df539f4dbec

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 02388060b5e05cd6abcd8e4e635b3324d29b78fccd0ee0e7136adbc7cb90892f
MD5 2c7aa5ef4be82e03f2221755901b4735
BLAKE2b-256 99d65afc86114b52ee0729df6c6544108c8fd793d5b3255f1ed4fbd01f4057dc

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.7-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.7-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5c8d61e4417d266423251283401ff908d8ddd1e50fbf4d6b7cc81ee1d94a7283
MD5 9f5e4e1e9a5bf8012d7127405e992f56
BLAKE2b-256 99613de4cc39556f0379d7c103a8f0ca149b99cbb78bb9cd14615f97dce114c4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page