No project description provided
Project description
BunkaTopics
BunkaTopics is a Topic Modeling package that leverages Embeddings and focuses on Topic Representation to extract meaningful and interpretable topics from a list of documents.
Installation
Before installing bunkatopics, please install the following packages:
Load the spacy language models
python -m spacy download fr_core_news_lg
python -m spacy download en_core_web_sm
Eventually, install bunkatopic using pip
pip install bunkatopics
Quick Start with BunkaTopics
from bunkatopics import BunkaTopics
import pandas as pd
data = pd.read_csv('data/imdb.csv', index_col = [0])
data = data.sample(2000, random_state = 42)
# Instantiate the model, extract ther terms and Embed the documents
model = BunkaTopics(data, # dataFrame
text_var = 'description', # Text Columns
index_var = 'imdb', # Index Column (Mandatory)
extract_terms=True, # extract Terms ?
terms_embeddings=True, # extract terms Embeddings?
docs_embeddings=True, # extract Docs Embeddings?
embeddings_model="distiluse-base-multilingual-cased-v1", # Chose an embeddings Model
multiprocessing=True, # Multiprocessing of Embeddings
language="en", # Chose between English "en" and French "fr"
sample_size_terms = len(data),
terms_limit=10000, # Top Terms to Output
terms_ents=True, # Extract entities
terms_ngrams=(1, 2), # Chose Ngrams to extract
terms_ncs=True, # Extract Noun Chunks
terms_include_pos=["NOUN", "PROPN", "ADJ"], # Include Part-of-Speech
terms_include_types=["PERSON", "ORG"]) # Include Entity Types
# Extract the topics
topics = model.get_clusters(topic_number= 15, # Number of Topics
top_terms_included = 1000, # Compute the specific terms from the top n terms
top_terms = 5, # Most specific Terms to describe the topics
term_type = "lemma", # Use "lemma" of "text"
ngrams = [1, 2], # N-grams for Topic Representation
clusterer = 'hdbscan') # Chose between Kmeans and HDBSCAN
# Visualize the clusters. It is adviced to choose less that 5 terms - top_terms = 5 - to avoid overchanging the Figure
fig = model.visualize_clusters(search = None,
width=1000,
height=1000,
fit_clusters=True, # Fit Umap to well visually separate clusters
density_plot=False) # Plot a density map to get a territory overview
fig.show()
centroid_documents = model.get_centroid_documents(top_elements=2)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bunkatopics-0.33.tar.gz
(14.2 kB
view details)
Built Distribution
File details
Details for the file bunkatopics-0.33.tar.gz
.
File metadata
- Download URL: bunkatopics-0.33.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.10.8 Darwin/21.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f87fe2dadc8169b011afb301647a3314d821a7f96c3a3201afffd6daf46c95d |
|
MD5 | 031e1253c262e2def8ae4dfc15a83e92 |
|
BLAKE2b-256 | bb0e2d3ed38f09ca1936b747551cf70707466abb3c82ad1b1ffe27944c015967 |
File details
Details for the file bunkatopics-0.33-py3-none-any.whl
.
File metadata
- Download URL: bunkatopics-0.33-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.10.8 Darwin/21.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6cd28b6838cc36d0ab7a3f579171723470d3cb151319bb88ca65347517fd1aa5 |
|
MD5 | 62c3e2d9be1ca34f9b262a5d28542325 |
|
BLAKE2b-256 | a7bf18032911ece72dae7a97f10bedb1e7f085a67135b21a68c93bd0d025d5c6 |