Skip to main content

User-friendly, low-code text clustering

Project description

Text Clustering

This repository contains tools to easily embed and cluster texts as well as label clusters semantically and produce visualizations of those labeled clusters.

Clustering of texts in the Cosmopedia dataset.

This project is a fork of 'huggingface/text-clustering'. The following changes have been made:

  1. Projection and clustering algorithms can now be selected by the user as appropriate for their use-case.
  2. Each algorithm's relevant hyperparamaters can be provided by the user as a dictionary, without having to store all possible hyperparameters.
  3. Visualizations can now be done interactively in 3 dimensions.
  4. The pipeline can be run and re-run with new hyperparameters, or even new algorithm selections for projections and/or clustering without having to re-perform computationally expensive embedding or projections unnecessarily.
  5. Texts can be batched into groups prior to clustering.
  6. A simple automated test suite has been added to the repo.

Additionally, a substantial amount of documentation has been added to this repository for both the new functionality and the original functionality, improving readability and usability. This documentation is available as comments in the code and in a standalone document.

Documentation can be found here

How it works

The pipeline consists of several distinct blocks that can be customized and the whole pipeline can run in a few minutes on a consumer laptop. Each block uses existing standard methods and works quite robustly. The default pipeline is shown in the graphic below.

Text clustering pipeline.

As was true in the original repo, users can choose alternative models for Embeddings and labeling. Additionally, in this version, users can choose alternative algorithms for projection and clustering, and customize all hyperparameters for those algorithms.

Install

Install the following libraries to get started:

pip install scikit-learn umap-learn sentence_transformers faiss-cpu plotly matplotlib datasets

Clone this repository and navigate to the folder:

git clone https://github.com/billingsmoore/text-clustering.git
cd text-clustering

Basic Usage

Run pipeline and visualize results:

from src.text_clustering import ClusterClassifier
from datasets import load_dataset

SAMPLE = 100_000

texts = load_dataset("HuggingFaceTB/cosmopedia-100k", split="train").select(range(SAMPLE))["text"]

cc = ClusterClassifier()

# run the pipeline:
embs, labels, summaries = cc.fit(texts)

# show the results
cc.show()

# save 
cc.save("./cc_100k")

Load classifier and run inference:

from src.text_clustering import ClusterClassifier

cc = ClusterClassifier()

# load state
cc.load("./cc_100k")

# visualize
cc.show()

# classify new texts with k-nearest neighbour search
cluster_labels, embeddings = cc.infer(some_texts, top_k=1)

If you want to reproduce the color scheme in the plot above you can add the following code before you run cc.show():

from cycler import cycler
import matplotlib.pyplot as plt

default_cycler = (cycler(color=[
    "0F0A0A",
    "FF6600",
    "FFBE00",
    "496767",
    "87A19E",
    "FF9200",
    "0F3538",
    "F8E08E",
    "0F2021",
    "FAFAF0"])
    )
plt.rc('axes', prop_cycle=default_cycler)

If you would like to customize the plotting further the easiest way is to customize or overwrite the _show_mpl and _show_plotly methods.

Advanced Usage

from src.text_clustering import ClusterClassifier
from datasets import load_dataset

SAMPLE = 100_000

texts = load_dataset("HuggingFaceTB/cosmopedia-100k", split="train").select(range(SAMPLE))["text"]

# initialize the ClusterClassifier to use TruncatedSVD with appropriate params
# also set the clustering to use KMeans clustering with appropriate params
cc = ClusterClassifier(
    projection_algorithm='tsvd', 
    projection_args={'n_components': 5, 'n_iter': 7, 'random_state': 42},
    clustering_algorithm='kmeans',
    clustering_args={'n_clusters': 2, 'random_state': 0, 'n_init': "auto"})

# run the pipeline:
cc.fit(texts)

# show the results
cc.show()

# if results are unsatisfactory, refit with new selections
cc.fit(
    projection_algorithm='pca', 
    projection_args={'n_components': 3},
    clustering_algorithm='hdbscan',
    clustering_args={'min_cluster_size': 10})

cc.show()


# still unsatisfied? you can keep projections, but change clustering params
cc.fit(clustering_args={'min_cluster_size': 25})

cc.show()

# save when done
cc.save("./cc_100k")

Command Line Usage

You can also run the pipeline using a script with:

# run a new pipeline
python run_pipeline.py --mode run  --save_load_path './cc_100k' --n_samples 100000 --build_hf_ds
# load existing pipeline
python run_pipeline.py --mode load --save_load_path './cc_100k' --build_hf_ds
# inference mode on new texts from an input dataset
python run_pipeline.py --mode infer --save_load_path './cc_100k'  --n_samples <NB_INFERENCE_SAMPLES> --input_dataset <HF_DATA_FOR_INFERENCE>

The build_hf_ds flag builds and pushes HF datasets, for the files and clusters, that can be directly used in the FW visualization space. In infer mode, we push the clusters dataset by default.

You can also change how the clusters are labeled (multiple topics (default) vs single topic with an educational score) using the flag --topic_mode.

Examples

Check the examples folder for an example of clustering and topic labeling applied to the AutoMathText dataset, utilizing Cosmopedia's web labeling approach.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easy_text_clustering-0.0.4.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

easy_text_clustering-0.0.4-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file easy_text_clustering-0.0.4.tar.gz.

File metadata

  • Download URL: easy_text_clustering-0.0.4.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for easy_text_clustering-0.0.4.tar.gz
Algorithm Hash digest
SHA256 f69e3dd43b55e0d883c263f9a701309f245887f58a98d996fb0e028d1d4db7c2
MD5 617385de1a447817cc1e670a4e4977bd
BLAKE2b-256 9ecb3d04d4b93dcb6f3ae6fdd9484c108614706634177064bdfbabe026871fea

See more details on using hashes here.

File details

Details for the file easy_text_clustering-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for easy_text_clustering-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b08d5513cd3a1459a44307147808c074d326e9aa0b50ea029ff98c9fe3f05b40
MD5 19520053e1c386c1db8dd078d14a7718
BLAKE2b-256 928ba4c143aedb651ed9ef93ee31d6d53b551058a3abf35aa239eb96d38336c3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page