Skip to main content

Super Simple Labelled Topic Clustering

Project description

Labelled Topic Clustering

Labelled Topic Clustering is as the name suggests, feed it an array of sentences and it will cluster them with human-readable names.

The aim of this project is to make it as easy-as-possible to:

  1. generate topic clusters on a text dataset using a cosine-similarity approach.
  2. get human-readable labels for those clusters

labelled topic clustering approach

Installation

To use the TopicClusterer class, you need to install the required packages. Assuming you have a package manager like pip, you can install the dependencies as follows:

pip install labelled-topic-clustering

Usage

  1. Initialize the TopicClusterer:
from topic_clusterer import TopicClusterer

hf_token = "your_hugging_face_token"
# This can be any sentence-transformer, anecdotally I've found this the best.
model = "sentence-transformers/all-mpnet-base-v2"

clusterer = TopicClusterer(hf_token, model, debug=True)
  1. Get clusters:
sentences = [
    "the weather is great",
    "This is some perfect weather",
    "we're having some really good weather",
    "my dog ate my homework",
    "why do dogs love homework?",
    "dog keeps devouring my homework"
]

clusters = clusterer.get_clusters(sentences)

Example Output

[[0, 1, 2], [3, 4, 5]]

clusters will be a 2d array representing clusters with sentence indicies for the original dataset

  1. Get labels from clusters:
clusters_labelled = clusterer.get_labels_from_clusters(clusters, sentences)

Example Output

{'Weather great perfect': [0, 1, 2], 'Dog eat homework': [3, 4, 5]}

clusters_labelled is a dictionary where the keys are topic labels, and the values are arrays of sentence indices corresponding to the original dataset.

You can also just get it all at once:

# Get clusters with labels
labelled_clusters = clusterer.get_clusters_with_labels(sentences)
print(labelled_clusters)

Contributing

You can view all the info on development and contributing here

Looking Forward

I have done virtually no performance testing as I wrote this once and it was all I needed for a side project.

Some ideas to work on:

  • Allow custom tokenizers
  • Benchmark performance on large datasets
  • Allow for feature extraction locally

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labelled-topic-clustering-1.1.0.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file labelled-topic-clustering-1.1.0.tar.gz.

File metadata

File hashes

Hashes for labelled-topic-clustering-1.1.0.tar.gz
Algorithm Hash digest
SHA256 2962ff0ff63933cd7cea75787f670e41e8d0a45df629e97c2f70ba47031c4c1f
MD5 d513a679c4cbc5993c253f2a88cad197
BLAKE2b-256 ef8aadcaadbe649857507a04c85131217ba71de7a70e89660a0653738f106089

See more details on using hashes here.

File details

Details for the file labelled_topic_clustering-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for labelled_topic_clustering-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8ec3c07e74b03cd2aa3d4d0cbf50e41ce4aac85c36b7398cffe600732664e434
MD5 f0a24747f27ae885fd722168f8cd94fb
BLAKE2b-256 a0ef0bed82ea0f1b5501a18c8b0f6b150d4330d9d81911fd97884af43d58dbd4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page