Topic Modeling using Transformers and advanced visualization

These details have not been verified by PyPI

Project description

Bunkatopics

Bunkatopics is a Topic Modeling Visualisation Method that leverages Transformers from HuggingFace through langchain. It is built with the same philosophy as BERTopic but goes deeper in the visualization to help users grasp quickly and intuitively the content of thousands of text. It also allows for a supervised visual representation by letting the user create continnums with natural language.

Installation

First, create a new virtual environment using pyenv

pyenv virtualenv 3.9 bunkatopics_env

Activate the environment

pyenv activate bunkatopics_env

Then Install the Bunkatopics package:

pip install bunkatopics

Install the spacy tokenizer model for english:

python -m spacy download en_core_web_sm

Contributing

Any contribution is more than welcome

pip install poetry
git clone https://github.com/charlesdedampierre/BunkaTopics.git
cd BunkaTopics

# Create the environment from the .lock file. 
poetry install # This will install all packages in the .lock file inside a virtual environmnet

# Start the environment
poetry shell

Getting Started

Name	Link
Visual Topic Modeling With Bunkatopics

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

from bunkatopics import bunkatopics
from sklearn.datasets import fetch_20newsgroups
import random
 
full_docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
full_docs_random = random.sample(full_docs, 1000)

You can the load any model from langchain. Some of them might be large, please check the langchain documentation

If you want to start with a small model:

from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

bunka = Bunka(model_hf=embedding_model)

bunka.fit(full_docs)
df_topics = bunka.get_topics(n_clusters = 20)

If you want a bigger LLM Like Instructor

from langchain.embeddings import HuggingFaceInstructEmbeddings

embedding_model = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large",
                                                embed_instruction="Embed the documents for visualisation of Topic Modeling on a map : ")

bunka = Bunka(model_hf=embedding_model)

bunka.fit(full_docs)
df_topics = bunka.get_topics(n_clusters = 20)

Then, we can visualize

topic_fig = bunka.visualize_topics( width=800, height=800)
topic_fig
...

The map display the different texts on a 2-Dimensional unsupervised scale. Every region of the map is a topic described by its most specific terms.

bourdieu_fig = bunka.visualize_bourdieu(x_left_words=["past"],
                                        x_right_words=["future", "futuristic"],
                                        y_top_words=["politics", "Government"],
                                        y_bottom_words=["cultural phenomenons"],
                                        height=2000,
                                        width=2000)

The power of this visualisation is to constrain the axis by creating continuums and looking how the data distribute over these continuums. The inspiration is coming from the French sociologist Bourdieu, who projected items on 2 Dimensional maps.

Multilanguage

The package use Spacy to extract meaningfull terms for the topic represenation.

If you wish to change language to french, first, download the corresponding spacy model:

python -m spacy download fr_core_news_lg

from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="distiluse-base-multilingual-cased-v2")

bunka = Bunka(model_hf=embedding_model, language = 'fr_core_news_lg')

bunka.fit(full_docs)
df_topics = bunka.get_topics(n_clusters = 20)

Functionality

Here are all the things you can do with Bunkatopics

Common

Below, you will find an overview of common functions in BERTopic.

Method	Code
Fit the model	`.fit(docs)`
Fit the model and get the topics	`.fit_transform(docs)`
Acces the topics	`.get_topics(n_clusters=10)`
Access the top documents per topic	`.get_top_documents()`
Access the distribution of topics	`.get_topic_repartition()`
Visualize the topics on a Map	`.visualize_topics()`
Visualize the topics on Natural Language Supervised axis	`.visualize_bourdieu()`
Access the Coherence of Topics	`.get_topic_coherence()`
Get the closest documents to your search	`.search('politics')`

Attributes

You can access several attributes

Attribute	Description
`.docs`	The documents stores as a Document pydantic model
`.topics`	The Topics stored as a Topic pydantic model.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.46.1

May 14, 2024

0.46

Apr 11, 2024

0.45

Jan 21, 2024

0.43

Oct 20, 2023

0.42

Oct 20, 2023

0.41

Oct 7, 2023

0.39

Jun 22, 2023

This version

0.38

Jun 13, 2023

0.37

Jun 13, 2023

0.36

Jun 7, 2023

0.35

Jun 7, 2023

0.34

Jan 10, 2023

0.33

Jan 2, 2023

0.32

Jan 2, 2023

0.31

Jan 2, 2023

0.30

Dec 19, 2022

0.29

Dec 17, 2022

0.28

Dec 15, 2022

0.27

Dec 15, 2022

0.26

Dec 5, 2022

0.25

Oct 23, 2022

0.24

Oct 21, 2022

0.23

Oct 20, 2022

0.22

Oct 20, 2022

0.21

Oct 20, 2022

0.20

Jul 22, 2022

0.19

Jun 28, 2022

0.18

Jun 28, 2022

0.17

Jun 28, 2022

0.16

Jun 27, 2022

0.15

Jun 25, 2022

0.14

May 27, 2022

0.13

May 27, 2022

0.12

May 25, 2022

0.11

May 24, 2022

0.8

May 24, 2022

0.7

May 24, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunkatopics-0.38.tar.gz (18.0 kB view details)

Uploaded Jun 13, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bunkatopics-0.38-py3-none-any.whl (21.1 kB view details)

Uploaded Jun 13, 2023 Python 3

File details

Details for the file bunkatopics-0.38.tar.gz.

File metadata

Download URL: bunkatopics-0.38.tar.gz
Upload date: Jun 13, 2023
Size: 18.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.2.2 CPython/3.10.12 Darwin/21.3.0

File hashes

Hashes for bunkatopics-0.38.tar.gz
Algorithm	Hash digest
SHA256	`fb7a19fe220065e33783310a39e5612b5c4f88065c1ee9dd859af02f149d9f8f`
MD5	`31e1e19cd0302b26cd0c5ab00f24ceb6`
BLAKE2b-256	`f3141ba7f41477d02f36738c892db3413e2728cd396549d5199e837308dba206`

See more details on using hashes here.

File details

Details for the file bunkatopics-0.38-py3-none-any.whl.

File metadata

Download URL: bunkatopics-0.38-py3-none-any.whl
Upload date: Jun 13, 2023
Size: 21.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.2.2 CPython/3.10.12 Darwin/21.3.0

File hashes

Hashes for bunkatopics-0.38-py3-none-any.whl
Algorithm	Hash digest
SHA256	`53c510d1c61adb2c57fbb82f038c525aa439d5d10c9aa160b72d09b63666c9de`
MD5	`7941923b178937f8eeb371e79135f622`
BLAKE2b-256	`5f45dd9c54241b0ff50f61a22e8cd00bbc0423896d7de879858d591a287c1792`

See more details on using hashes here.

bunkatopics 0.38

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Bunkatopics

Installation

Contributing

Getting Started

Quick Start

Multilanguage

Functionality

Common

Attributes

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes