Skip to main content

sequential Information Bottleneck

Project description

sequential Information Bottleneck (sIB)

GitHub Actions CI status

Scope

This project provides an efficient implementation of the text clustering algorithm "sequential Information Bottleneck" (sIB), introduced by Slonim, Friedman and Tishby (2002). The project is packaged as a python library with a cython-wrapped C++ extension for the partition optimization code. A pure python implementation is included as well. The implementation is documented here.

Installation

pip install sib-clustering

Usage

The main class in this library is SIB, which implements the clustering interface of SciKit Learn, providing methods such as fit(), fit_transform(), fit_predict(), etc.

The sample code below clusters the 18.8K documents of the 20-News-Groups dataset into 20 clusters:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn import metrics
from sib import SIB

# read the dataset
dataset = fetch_20newsgroups(subset='all', categories=None,
                             shuffle=True, random_state=256)

gold_labels = dataset.target
n_clusters = np.unique(gold_labels).shape[0]

# create count vectors using the 10K most frequent words
vectorizer = CountVectorizer(max_features=10000)
X = vectorizer.fit_transform(dataset.data)

# SIB initialization and clustering; parameters:
# perform 10 random initializations (n_init=10); the best one is returned.
# up to 15 optimization iterations in each initialization (max_iter=15)
# use all cores in the running machine for parallel execution (n_jobs=-1)
sib = SIB(n_clusters=n_clusters, random_state=128, n_init=10,
          n_jobs=-1, max_iter=15, verbose=True)
sib.fit(X)

# report standard clustering metrics
print("Homogeneity: %0.3f" % metrics.homogeneity_score(gold_labels, sib.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(gold_labels, sib.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(gold_labels, sib.labels_))
print("Adjusted Rand-Index: %.3f" % metrics.adjusted_rand_score(gold_labels, sib.labels_))

Expected result:

sIB information stats on best partition:
	I(T;Y) = 0.5685, H(T) = 4.1987
	I(T;Y)/I(X;Y) = 0.1468
	H(T)/H(X) = 0.2956
Homogeneity: 0.616
Completeness: 0.633
V-measure: 0.624
Adjusted Rand-Index: 0.507

See the Examples directory for more illustrations and a comparison against K-Means.

License

Copyright IBM Corporation 2020

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

If you would like to see the detailed LICENSE click here.

Authors

If you have any questions or issues you can create a new issue here.

Reference

N. Slonim, N. Friedman, and N. Tishby (2002). Unsupervised Document Classification using Sequential Information Maximization. SIGIR 2002. https://dl.acm.org/doi/abs/10.1145/564376.564401

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sib-clustering-0.1.8.tar.gz (145.6 kB view details)

Uploaded Source

Built Distributions

sib_clustering-0.1.8-cp39-cp39-win_amd64.whl (210.7 kB view details)

Uploaded CPython 3.9 Windows x86-64

sib_clustering-0.1.8-cp39-cp39-win32.whl (199.2 kB view details)

Uploaded CPython 3.9 Windows x86

sib_clustering-0.1.8-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (573.4 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

sib_clustering-0.1.8-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl (540.8 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ i686 manylinux: glibc 2.5+ i686

sib_clustering-0.1.8-cp39-cp39-macosx_10_9_x86_64.whl (222.9 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

sib_clustering-0.1.8-cp38-cp38-win_amd64.whl (210.7 kB view details)

Uploaded CPython 3.8 Windows x86-64

sib_clustering-0.1.8-cp38-cp38-win32.whl (199.1 kB view details)

Uploaded CPython 3.8 Windows x86

sib_clustering-0.1.8-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (581.9 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

sib_clustering-0.1.8-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl (550.0 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ i686 manylinux: glibc 2.5+ i686

sib_clustering-0.1.8-cp38-cp38-macosx_10_9_x86_64.whl (221.1 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

sib_clustering-0.1.8-cp37-cp37m-win_amd64.whl (210.0 kB view details)

Uploaded CPython 3.7m Windows x86-64

sib_clustering-0.1.8-cp37-cp37m-win32.whl (198.2 kB view details)

Uploaded CPython 3.7m Windows x86

sib_clustering-0.1.8-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (550.2 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

sib_clustering-0.1.8-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl (515.7 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ i686 manylinux: glibc 2.5+ i686

sib_clustering-0.1.8-cp37-cp37m-macosx_10_9_x86_64.whl (221.3 kB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

sib_clustering-0.1.8-cp36-cp36m-win_amd64.whl (219.9 kB view details)

Uploaded CPython 3.6m Windows x86-64

sib_clustering-0.1.8-cp36-cp36m-win32.whl (203.4 kB view details)

Uploaded CPython 3.6m Windows x86

sib_clustering-0.1.8-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (550.3 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.5+ x86-64

sib_clustering-0.1.8-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl (515.0 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ i686 manylinux: glibc 2.5+ i686

sib_clustering-0.1.8-cp36-cp36m-macosx_10_9_x86_64.whl (221.1 kB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file sib-clustering-0.1.8.tar.gz.

File metadata

  • Download URL: sib-clustering-0.1.8.tar.gz
  • Upload date:
  • Size: 145.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for sib-clustering-0.1.8.tar.gz
Algorithm Hash digest
SHA256 bf9b2c533c857670a39f14d17dab2caa02122a6586117848747556dc2d48de44
MD5 d282197e2eccb693940bc9a6dd100ab3
BLAKE2b-256 da2dd236cbeef4567748db345d0b5b16665c8f7264bd4ad2ba87c507b31c0d93

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 b4dc3123a5408d98801410d7fa2b769e23d14e5773226751ffd32600b6bc6103
MD5 cf3d79402955a3db315e799bf330be99
BLAKE2b-256 06839e33ac45cdefd1376a2a8e7b01bada706f03b8a25a93906ab851ece91a73

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp39-cp39-win32.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp39-cp39-win32.whl
Algorithm Hash digest
SHA256 fd30020c631dea4e09ba8e82a9195e8b7f23851ef070b97d6aef85f314519f0b
MD5 66367816774bdabed7329093bc6dacdd
BLAKE2b-256 4fd8d02e45472e20d88429a72a8d962f25f90d8d055c5ab465a219aedc13398e

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 5f65bbd36ee6469e21cfc09cc72cedf69ff68a4eaf7d540a7c3d7daa1123e983
MD5 a80b4c8f8b7475c5127029974a024d19
BLAKE2b-256 d7d9df926eed6d21d098ed5e8c409ebf655ddc153a52280f65c8012c7c7c19c4

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 d08e3e38b09b006b1e94c1b7523268f27d4cb58239ea645b192f88b21f2853df
MD5 dd81763828c9cdb67368fa97cd1f5de3
BLAKE2b-256 d0f5945eebb9bf64fc7a89beb633f3c5b8d9350c2d5c082ee9acaba1de666b62

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 135784d541a3663cb01a2f47f28754ed6daa0b4ed5aaf0f1e4bb80a2f03d95d7
MD5 5963e33dec8295e8afc3248fe28c9f82
BLAKE2b-256 84239a0bd9b91f1e3cb385a53cf8c06e09a14de61ca4225efc024a1d06ec9cc4

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 96e0537d2bc0ae4195286ee5531e88c3933571cb7581ac42a38372825e589028
MD5 3e0f3b96bea5fe381fe1937be57a8102
BLAKE2b-256 cfd615040b01297324928c77797740bf4afdfb7176cd3f472de23022f92fd769

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp38-cp38-win32.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 aa972a6fd4ea7b58b21ffe10211eeb5345ad5810064f59e6fe4900018983a4e6
MD5 8b62984afa94d65b1701d9daff993790
BLAKE2b-256 467dbbd745a564833f519fa507ce3abc07f16e4e67e4633ea14333b706ec895c

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 c41e895bb16c6502701c05f658b31c4af24685bbbd8e86fa4a4c0b0830c56240
MD5 04e6d65dfccc8ab5c4a8fdedccdd9546
BLAKE2b-256 a76fbc57a3e07786b0e0917a00d7683352dc60ff3eac14127abbdbe7779d0925

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 216ee053b44b58448a98cdc51b52ace6c41d5a9612b97ed50906a1f8ebafd5a3
MD5 a5b4f2d6a367367ad334ed13db4d775c
BLAKE2b-256 7f8500078486021e9b5ad694aa43a46d54a38c66f92f0d71031d46d6006e5c23

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5f3120766ecfce4995b2b34d8d872897289db11ec245134c6ad7b2256ef44e00
MD5 9fbb6db0cba4dc8bd90c482bf901af25
BLAKE2b-256 d199170810a6fd23f08034eef1a1ab55131151d4e7f70c4a7cca1ed7a2b28094

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 1f6ffe9bba18bff02247b39bc51e99084786a7128c3e84c973a0599a23f62fe0
MD5 16508250d324c74169235a881b7dd612
BLAKE2b-256 c34dac58983124a3cb1e2e8c753854f5ee5cb4400cb9d5fa977386913f317edf

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp37-cp37m-win32.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 71b2c0eb6795f36d2679c4584fdbe766522eb2e74ef52c9f1916f2c9d1361729
MD5 19ab0080e8a1cd9ecb1bc3d72bb9c67e
BLAKE2b-256 9af69af2fd1503a9463616880364ef4255d4cae0e1a55a616ef7a6947be40964

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b46c192d58637d730727bad613a874973d1c47814c68ec11943d38632656b336
MD5 9c36e6d3ad0a56238c69156fc03eccd1
BLAKE2b-256 e45d6d11a60b2d084d3923983b2b1dbb4bd57d04376709d0f932117296fa0111

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 3c5905afda29c164ee946db1789af807dc3ef899da8183f30c6327c682afe1a7
MD5 696ad3e5257444e56cffc72873e15f0b
BLAKE2b-256 7b1e43b994281be6f1867f703e9b3b47eb0736166dc50be5017e156b4e747071

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5ae23aa2581c64c7e0d6b4a48db61dd24b515c3ecc93dbd98ba56bb4531af1e4
MD5 f33af24980ccc717c67adbfaea8d7969
BLAKE2b-256 58f310edb5401d18280839745625ddeeff45c5a7daf6ed7c0717eec8ae591682

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp36-cp36m-win_amd64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 bfc13dd2f9a47aec1cab122c1d9dc867547f4091a45613c13aa6e913950e7037
MD5 c00a73184cf421eba503be5dfa546670
BLAKE2b-256 cf0c62bed81c5cd181ad787f860e9f854e900f9fe0a2b6d2bad852cffdd9c02b

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp36-cp36m-win32.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 bc865a6b99be20388d36ac1bbe8a979d9f29456ccf08ce452acbd5a190f131c3
MD5 09980561bb567e6789af4bfb0e99c203
BLAKE2b-256 b7729f81d5d18be0ee955c4c814526d35f200f7bc467302e9a8785994dd9aa82

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 50c092fd9fc4e843638a9244f2e81ae5ce7a359b7b7121f2f9505460424a5e20
MD5 b130d740cd66a8f78ca1c37fb82978a3
BLAKE2b-256 8e88b511dfe73b0c6288c98e8a67821adae65aa390658c0267cc3361601bfa57

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 4e646501da675ac03a97c095cb46e37ab1517b96f120ca2c507b491c5d265da8
MD5 2ef060ab12b17a147646b54e1b19dee4
BLAKE2b-256 5eed8b3876c0af44481ab321143133754727e6d60c3be918de54070190643053

See more details on using hashes here.

File details

Details for the file sib_clustering-0.1.8-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.1.8-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2e7192eb57ea577e1f33f93e95ef6e591a085371dd8a08c55a94f051edbc2bf1
MD5 d969554aad02545bb5e85583cacb8abb
BLAKE2b-256 61ee60dca104833211f1d1fcc40d012a0f8cbfa7ec6beb19967cf9c340f78f6a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page