Skip to main content

A new transformer-based topic modeling library.

Project description

PyPI - PyPi

Leet Topic Logo

LeetTopic builds upon Top2Vec, BerTopic and other transformer-based topic modeling Python libraries. Unlike BerTopic and Top2Vec, LeetTopic allows users to control the degree to which outliers are resolved into neighboring topics.

It also lets you turn any DataFrame into a Bokeh application for exploring your documents and topics. As of 0.0.10, LeetTopic also allows users to generate an Annoy Index as part of the LeetTopic pipeline. This allows users to construct a query their data.

Installation

pip install leet-topic

Parameters

  • df => a Pandas DataFrame that contains the documents that you want to model
  • document_field => the DataFrame column name where your documents sit
  • html_filename => the filename used to generate the Bokeh application
  • extra_fields => a list of extra columns to include in the Bokeh application
  • max_distance => the maximum distance between a document and the nearest topic vector to be considered for outliers

Usage

import pandas as pd
from leet_topic import leet_topic

df = pd.read_json("data/vol7.json")
leet_df, topic_data = leet_topic.LeetTopic(df,
                                          document_field="descriptions",
                                          html_filename="demo.html",
                                          extra_fields=["names", "hdbscan_labels"],
                                          max_distance=.5)

Multilingual Support

With LeetTopic, you can work with texts in any language supported by spaCy for lemmatization and any model from HuggingFace via Sentence Transformers.

Here is an example working with Croatian

import pandas as pd
from leet_topic import leet_topic

df = pd.DataFrame(["Bok. Kako ste?", "Drago mi je"]*20, columns=["text"])
leet_df, topic_data = leet_topic.LeetTopic(df,
                                          document_field="text",
                                          html_filename="demo.html",
                                          extra_fields=["hdbscan_labels"],
                                          spacy_model="hr_core_news_sm",
                                          max_distance=.5)

Custom UMAP and HDBScan Parameters

It is often necessary to control how your embeddings are flattened with UMAP and clustered with HDBScan. As of 0.0.9, you can control these parameters with dictionaries.

import pandas as pd
from leet_topic import leet_topic

df = pd.read_json("data/vol7.json")
leet_df, topic_data = leet_topic.LeetTopic(df,
                                          document_field="descriptions",
                                          html_filename="demo.html",
                                          extra_fields=["names", "hdbscan_labels"],
                                          umap_params={"n_neighbors": 15, "min_dist": 0.01, "metric": 'correlation'},
                                          hdbscan_params={"min_samples": 10, "min_cluster_size": 5},
                                          max_distance=.5)

Create an Annoy Index

As of 0.0.10, users can also return an Annoy Index.

import pandas as pd
from leet_topic import leet_topic

df = pd.read_json("data/vol7.json")
leet_df, topic_data, annoy_index = leet_topic.LeetTopic(df, "descriptions",
            "demo.html",
            build_annoy=True)

To leverage the Annoy Index, one can easily create a semantic search engine. One can query the index, for example, by encoding a new text with the same model.

import pandas as pd
from leet_topic import leet_topic
from sentence_transformers import SentenceTransformer


model = SentenceTransformer('all-MiniLM-L6-v2')

emb = model.encode("An individual who was arrested.")

res = annoy_index.get_nns_by_vector(emb, 10)

print(df.iloc[res].descriptions.tolist())

Outputs

This code above will generate a new DataFrame with the UMAP Projection (x, y), hdbscan_labels, and leet_labels, and top-n words for each document. It will also output data about each topic including the central plot of each vector, the documents assigned to it, top-n words associated with it.

Finally, the output will create an HTML file that is a self-contained Bokeh application like the image below.

demo

Steps

LeetTopic takes an input DataFrame and converts the document field (texts to model) into embeddings via a transformer model. Next, UMAP is used to reduce the embeddings to 2 dimensions. HDBScan is then used to assign documents to topics. Like BerTopic and Top2Vec, at this stage, there are many outliers (topics assigned to -1).

LeetTopic, like Top2Vec, then calculates the centroid for each topic vector based on the HDBScan labels while ignoring topic -1 (outlier). Next, all outlier documents are assigned to nearest topic centroid. Unlike Top2Vec, LeetTopic gives the user the ability to set a max distance so that outliers that are significantly away from a topic vector, they are not assigned to a nearest vector. At the same time, the output DataFrame contains information about the original HDBScan topics, meaning users know if a document was originally an outlier.

Future Roadmap

0.0.9

  • Control UMAP parameters
  • Control HDBScan parameters
  • Multilingual support for lemmatization
  • Multilingual support for embedding
  • Add support for custom App Titles

0.0.10

  • Output an Annoy Index so that the data can be queried

0.0.11

  • Support for embedding text, images, or both via CLIP and displaying the results in the same bokeh application

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leet_topic-0.0.11.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

leet_topic-0.0.11-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file leet_topic-0.0.11.tar.gz.

File metadata

  • Download URL: leet_topic-0.0.11.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for leet_topic-0.0.11.tar.gz
Algorithm Hash digest
SHA256 60ed35218f48398ac9a1849c9f7407847d375ff298f9365d1c11d1786064bf1d
MD5 c81c55b94451e0213fe78a0ee783497d
BLAKE2b-256 21e0b6de7b3eaf4b13e4b9cf8967697875b8041668de92f512272822fb87c64b

See more details on using hashes here.

File details

Details for the file leet_topic-0.0.11-py3-none-any.whl.

File metadata

  • Download URL: leet_topic-0.0.11-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for leet_topic-0.0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 522a56d6c2f96be94d863c026ec2db1f8eafa101a9b4994a3a6081d751286ef4
MD5 82a58c83708c2c01f3b0e27d86861dea
BLAKE2b-256 a7a269daa8708a949acd2a34093418036db863c47b147bd90e3f0e7bb9a08898

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page