Skip to main content

sequential Information Bottleneck

Project description

sequential Information Bottleneck (sIB)

GitHub Actions CI status

Scope

This project provides an efficient implementation of the text clustering algorithm "sequential Information Bottleneck" (sIB), introduced by Slonim, Friedman and Tishby (2002). The project is packaged as a python library with a cython-wrapped C++ extension for the partition optimization code. A pure python implementation is included as well. The implementation is documented here.

Installation

pip install sib-clustering

Usage

The main class in this library is SIB, which implements the clustering interface of SciKit Learn, providing methods such as fit(), fit_transform(), fit_predict(), etc.

The sample code below clusters the 18.8K documents of the 20-News-Groups dataset into 20 clusters:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn import metrics
from sib import SIB

# read the dataset
dataset = fetch_20newsgroups(subset='all', categories=None,
                             shuffle=True, random_state=256)

gold_labels = dataset.target
n_clusters = np.unique(gold_labels).shape[0]

# create count vectors using the 10K most frequent words
vectorizer = CountVectorizer(max_features=10000)
X = vectorizer.fit_transform(dataset.data)

# SIB initialization and clustering; parameters:
# perform 10 random initializations (n_init=10); the best one is returned.
# up to 15 optimization iterations in each initialization (max_iter=15)
# use all cores in the running machine for parallel execution (n_jobs=-1)
sib = SIB(n_clusters=n_clusters, random_state=128, n_init=10,
          n_jobs=-1, max_iter=15, verbose=True)
sib.fit(X)

# report standard clustering metrics
print("Homogeneity: %0.3f" % metrics.homogeneity_score(gold_labels, sib.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(gold_labels, sib.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(gold_labels, sib.labels_))
print("Adjusted Rand-Index: %.3f" % metrics.adjusted_rand_score(gold_labels, sib.labels_))

Expected result:

sIB information stats on best partition:
	I(T;Y) = 0.5685, H(T) = 4.1987
	I(T;Y)/I(X;Y) = 0.1468
	H(T)/H(X) = 0.2956
Homogeneity: 0.616
Completeness: 0.633
V-measure: 0.624
Adjusted Rand-Index: 0.507

See the Examples directory for more illustrations and a comparison against K-Means.

License

Copyright IBM Corporation 2020

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

If you would like to see the detailed LICENSE click here.

Authors

If you have any questions or issues you can create a new issue here.

Reference

N. Slonim, N. Friedman, and N. Tishby (2002). Unsupervised Document Classification using Sequential Information Maximization. SIGIR 2002. https://dl.acm.org/doi/abs/10.1145/564376.564401

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sib-clustering-0.2.4.tar.gz (185.9 kB view details)

Uploaded Source

Built Distributions

sib_clustering-0.2.4-cp311-cp311-win_amd64.whl (266.9 kB view details)

Uploaded CPython 3.11 Windows x86-64

sib_clustering-0.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771.7 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

sib_clustering-0.2.4-cp311-cp311-macosx_11_0_arm64.whl (274.3 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

sib_clustering-0.2.4-cp311-cp311-macosx_10_9_x86_64.whl (281.6 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

sib_clustering-0.2.4-cp311-cp311-macosx_10_9_universal2.whl (383.0 kB view details)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

sib_clustering-0.2.4-cp310-cp310-win_amd64.whl (266.4 kB view details)

Uploaded CPython 3.10 Windows x86-64

sib_clustering-0.2.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (731.3 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

sib_clustering-0.2.4-cp310-cp310-macosx_11_0_arm64.whl (273.8 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

sib_clustering-0.2.4-cp310-cp310-macosx_10_9_x86_64.whl (281.0 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

sib_clustering-0.2.4-cp310-cp310-macosx_10_9_universal2.whl (382.0 kB view details)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64)

sib_clustering-0.2.4-cp39-cp39-win_amd64.whl (267.0 kB view details)

Uploaded CPython 3.9 Windows x86-64

sib_clustering-0.2.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (734.2 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

sib_clustering-0.2.4-cp39-cp39-macosx_11_0_arm64.whl (274.9 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

sib_clustering-0.2.4-cp39-cp39-macosx_10_9_x86_64.whl (281.6 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

sib_clustering-0.2.4-cp39-cp39-macosx_10_9_universal2.whl (383.7 kB view details)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64)

sib_clustering-0.2.4-cp38-cp38-win_amd64.whl (267.0 kB view details)

Uploaded CPython 3.8 Windows x86-64

sib_clustering-0.2.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (738.8 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

sib_clustering-0.2.4-cp38-cp38-macosx_11_0_arm64.whl (274.8 kB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

sib_clustering-0.2.4-cp38-cp38-macosx_10_9_x86_64.whl (281.6 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

sib_clustering-0.2.4-cp38-cp38-macosx_10_9_universal2.whl (383.5 kB view details)

Uploaded CPython 3.8 macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file sib-clustering-0.2.4.tar.gz.

File metadata

  • Download URL: sib-clustering-0.2.4.tar.gz
  • Upload date:
  • Size: 185.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for sib-clustering-0.2.4.tar.gz
Algorithm Hash digest
SHA256 86c96277aec76aefce0fcc13bd19faa28bc746935e9772b8be4b4e3fdef33741
MD5 70ee9cf1314486a045d491a7898cfe37
BLAKE2b-256 8a1de9432289ca3ea5344be9ea3baacf22c690dc08d22de53d08efaddb76adc2

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 ade67bc3894b2016c06abbc1fde0ee3cfa330429ebcb24c16c1fcf32b45082cb
MD5 928cc545f89eff6936e4fac5cb36d9b7
BLAKE2b-256 c5467d7a86df9564da8f3954c5d264e5dce8e89128ca484d64f7ab745d6c0261

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d93afc4c9efbd981e4b26548127b2cf3f6c2de8afb52a7c9299ee60989a50c45
MD5 5ddb62ff3cebe31cb72f0c9bd8931d03
BLAKE2b-256 5a309e8fb8ecb9afc9afb9a7c214f734546e98e183d00b69b370ac8001198c0f

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2a70a838726ef4ae78a296a5eb2e960260baabf8df458c8ed7d513098e2bdbbf
MD5 9137fac03ae4c9e2fb9ae10de6f7375f
BLAKE2b-256 d73e33b75cdb4cc0d403f78b7a59a56247b1fb946cf8cb867e7a2efde7020810

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 93e7a036abb67175b43ef817f825be02adfbae4ec09c738ae5fc47e9b819e2b3
MD5 71d2619edd3586c1d54eaeccb1c01dc1
BLAKE2b-256 619aa7c8c9df5e336ac1fcfee9eaf76c9a5010f27fff1fe110e5fdff1be06cf5

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 a1f4b032cedb86198492ff70186c554ad17b7664acb900302331361365bca830
MD5 1e8b4cc061af52aba740b2a6c5a8e492
BLAKE2b-256 55fa1a0146791a3eee95ce455f2d87b6be38f1c0faae9a572c39270822bef169

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 954ad4ba3638d11aff9e69c540f14309db62be75e0319a112041071a65b21d6f
MD5 c0938d2276f39dbdba0554957456c7ec
BLAKE2b-256 2e6cec6c797559b6e4c06fa0ddb3a802af7d4b1406d1f0beb42a640d7877a26b

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a57e544ab038d8fbdb9198d5fc94b89c9f4af10f0d22b66e61a7b10ff99c7876
MD5 2252ae39ce9f68939e3cdd9fe1571a9a
BLAKE2b-256 e4a7bca7fbae1960b6ecfe0394f8b62e255b61bc19ce4545d34f23f232ce94a6

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9df363bc4143a396f22214164b46689ecc7ef527a005aa75a44d1862537985f0
MD5 ae1b6977b1416b8f34d3570e71b221b3
BLAKE2b-256 4ef23597496f4f1b8fd7a9f76f02338327ca4545aa152fca663f773cde8eab65

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8fbc135766674aeef05b7568d78d62c58beb999458f12b663d5dfae28875429a
MD5 015c702f59e18665dd90c460c32aa34c
BLAKE2b-256 21c5fe68f3dcdc85c2194d852f173416b9d4996fb1c95012aed316373b19a25a

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp310-cp310-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp310-cp310-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 2457347de2206d16c09e0979fa7ad0bcd144f61440499717facf21c6916ce1cd
MD5 033f1a4701b051f17ece68b0a32744a1
BLAKE2b-256 1c775c59039113e24034bee8a6b67668c77aa80169e618d9390d8b3f63a1acb5

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 6df9bb1afd73a75b706d8eb8e2835b4fbd4539c80bf24b1f7ccdcbffb9f9f9fd
MD5 7e69a1bb0d030262f9a569bddb248e92
BLAKE2b-256 df2251847facc73e61874bd0d612bfb0fb15b90c3ba93c75ec76cb1427f85045

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f9653361e70629e48075e52f88505867c8b9bcc6946a28a909dd6b8623d51cd7
MD5 b45cf567acda21e20d648eecf90be62a
BLAKE2b-256 7cfb678da17f8ea261acc14ec0fb7e5f056560135785d0bc08664374000db5be

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dce5d06319c2c109ac22c0848bc20e6f8c5f07b06d9134d9a04e429a1e8e127d
MD5 986d17119e7f02229be9b1eda72239a2
BLAKE2b-256 febc9de4114f958bf7e8ae8dc213e3f00fe627bae95a8ab7d3decf4771b9f632

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 50ea613acd23128aac0b7d18b4257c569095139043ea0bea4426d70c9686a69d
MD5 7794480769ad3d3fbfba20a2407c47c2
BLAKE2b-256 1fd814a4da3f28f0b38bf76cfac8e17956b01333814af15dd6e2151f1642b999

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp39-cp39-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp39-cp39-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 0db98d68cf05d9a0a19d2b64f704fbd332308f5d9e5234cdf2d17d8c92fc8d14
MD5 4259876e6f16a310f13435c54ca6b9d8
BLAKE2b-256 4ccb6bcb521c82108017d5d7c88ee917f8e18cbef7f719d3cbd6a862a81754ef

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 c19e3a421924a54084236aeeb07da6b6f877306caf2359d34c96a1f6ca6ceb54
MD5 cdffc5e7565a09f40317ead6aef90e9e
BLAKE2b-256 444adc6be3e2edbdab07f23951a917d039e6dd9d67f5909a5df7017d6c4046e6

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0e2860db4977330d3f5b2d84f5b726fa5d32a3b4a7ed1a275df5963d48a574e2
MD5 de9bad0f02cf4bf12c275067f718bf36
BLAKE2b-256 3775eb1377f80359d146cb8ba6a94de79cfeaa75836e19ab410cba2baa851831

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 53a4a00b9d8c2d15fe40b6467fe33fd1f26ee69bb8278c75c9e85d7ad6f732e8
MD5 0e89e071dd0113220dbd04921c7211df
BLAKE2b-256 957dd83387a7bbe6cd1fb8b78fc88cbe5e7b3fd6761d24e03cdc0cf733d6298c

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 12dee5272a857075eb7ae6b91862816b1c4d1a73b1d4e71109a6eec18ac92960
MD5 706de0cd7b8c702b27167e66d87b7852
BLAKE2b-256 4addb8b743aa4fcf91272eb3b23637baec1ad29bf0d834345a8cd2966f151e46

See more details on using hashes here.

File details

Details for the file sib_clustering-0.2.4-cp38-cp38-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for sib_clustering-0.2.4-cp38-cp38-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 d8b2ac1ed78c6025ef1aafbfeda72aa39f508c9193ed1c733ccc2ea1cc126c53
MD5 2fd2371ae26564fd32bb16c52d142d61
BLAKE2b-256 cee2a8906a64fac5e34211f7ccf60cb1649f48cb25f8cbd1bbed4d3ca701c4c7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page