Super Simple Labelled Topic Clustering
Project description
Labelled Topic Clustering
Labelled Topic Clustering is as the name suggests, feed it an array of sentences and it will cluster them with human-readable names.
The aim of this project is to make it as easy-as-possible to:
- generate topic clusters on a text dataset using a cosine-similarity approach.
- get human-readable labels for those clusters
Installation
To use the TopicClusterer class, you need to install the required packages. Assuming you have a package manager like pip, you can install the dependencies as follows:
pip install labelled-topic-clustering
Usage
- Initialize the TopicClusterer:
from topic_clusterer import TopicClusterer
hf_token = "your_hugging_face_token"
# This can be any sentence-transformer, anecdotally I've found this the best.
model = "sentence-transformers/all-mpnet-base-v2"
clusterer = TopicClusterer(hf_token, model, debug=True)
- Get clusters:
sentences = [
"the weather is great",
"This is some perfect weather",
"we're having some really good weather",
"my dog ate my homework",
"why do dogs love homework?",
"dog keeps devouring my homework"
]
clusters = clusterer.get_clusters(sentences)
Example Output
[[0, 1, 2], [3, 4, 5]]
clusters
will be a 2d array representing clusters with sentence indicies for the original dataset
- Get labels from clusters:
clusters_labelled = clusterer.get_labels_from_clusters(clusters, sentences)
Example Output
{'Weather great perfect': [0, 1, 2], 'Dog eat homework': [3, 4, 5]}
clusters_labelled
is a dictionary where the keys are topic labels, and the values are arrays of sentence indices corresponding to the original dataset.
You can also just get it all at once:
# Get clusters with labels
labelled_clusters = clusterer.get_clusters_with_labels(sentences)
print(labelled_clusters)
Contributing
You can view all the info on development and contributing here
Looking Forward
I have done virtually no performance testing as I wrote this once and it was all I needed for a side project.
Some ideas to work on:
- Allow custom tokenizers
- Benchmark performance on large datasets
- Allow for feature extraction locally
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file labelled-topic-clustering-1.1.0.tar.gz
.
File metadata
- Download URL: labelled-topic-clustering-1.1.0.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2962ff0ff63933cd7cea75787f670e41e8d0a45df629e97c2f70ba47031c4c1f |
|
MD5 | d513a679c4cbc5993c253f2a88cad197 |
|
BLAKE2b-256 | ef8aadcaadbe649857507a04c85131217ba71de7a70e89660a0653738f106089 |
File details
Details for the file labelled_topic_clustering-1.1.0-py3-none-any.whl
.
File metadata
- Download URL: labelled_topic_clustering-1.1.0-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ec3c07e74b03cd2aa3d4d0cbf50e41ce4aac85c36b7398cffe600732664e434 |
|
MD5 | f0a24747f27ae885fd722168f8cd94fb |
|
BLAKE2b-256 | a0ef0bed82ea0f1b5501a18c8b0f6b150d4330d9d81911fd97884af43d58dbd4 |