Skip to main content

No project description provided

Project description

BunkaTopics

BunkaTopics is a Topic Modeling package that leverages Embeddings and focuses on Topic Representation to extract meaningful and interpretable topics from a list of documents.

Installation

Before installing bunkatopics, please install the following packages:

Load the spacy language models

python -m spacy download fr_core_news_lg
python -m spacy download en_core_web_sm

Eventually, install bunkatopic using pip

pip install bunkatopics

Quick Start with BunkaTopics

from bunkatopics import BunkaTopics
import pandas as pd

data = pd.read_csv('data/imdb.csv', index_col = [0])
data = data.sample(2000, random_state = 42)

# Instantiate the model, extract ther terms and Embed the documents

model = BunkaTopics(data, # dataFrame
                    text_var = 'description', # Text Columns
                    index_var = 'imdb',  # Index Column (Mandatory)
                    extract_terms=True, # extract Terms ?
                    terms_embeddings=True, # extract terms Embeddings?
                    docs_embeddings=True, # extract Docs Embeddings?
                    embeddings_model="distiluse-base-multilingual-cased-v1", # Chose an embeddings Model
                    multiprocessing=True, # Multiprocessing of Embeddings
                    language="en", # Chose between English "en" and French "fr"
                    sample_size_terms = len(data),
                    terms_limit=10000, # Top Terms to Output
                    terms_ents=True, # Extract entities
                    terms_ngrams=(1, 2), # Chose Ngrams to extract
                    terms_ncs=True, # Extract Noun Chunks
                    terms_include_pos=["NOUN", "PROPN", "ADJ"], # Include Part-of-Speech
                    terms_include_types=["PERSON", "ORG"]) # Include Entity Types

# Extract the topics

topics = model.get_clusters(topic_number= 15, # Number of Topics
                    top_terms_included = 1000, # Compute the specific terms from the top n terms
                    top_terms = 5, # Most specific Terms to describe the topics
                    term_type = "lemma", # Use "lemma" of "text"
                    ngrams = [1, 2], # N-grams for Topic Representation
                    clusterer = 'hdbscan') # Chose between Kmeans and HDBSCAN

# Visualize the clusters. It is adviced to choose less that 5 terms - top_terms = 5 - to avoid overchanging the Figure

fig = model.visualize_clusters(search = None, 
width=1000, 
height=1000, 
fit_clusters=True,  # Fit Umap to well visually separate clusters
density_plot=False) # Plot a density map to get a territory overview

fig.show()


centroid_documents = model.get_centroid_documents(top_elements=2)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunkatopics-0.33.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

bunkatopics-0.33-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file bunkatopics-0.33.tar.gz.

File metadata

  • Download URL: bunkatopics-0.33.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.10.8 Darwin/21.3.0

File hashes

Hashes for bunkatopics-0.33.tar.gz
Algorithm Hash digest
SHA256 4f87fe2dadc8169b011afb301647a3314d821a7f96c3a3201afffd6daf46c95d
MD5 031e1253c262e2def8ae4dfc15a83e92
BLAKE2b-256 bb0e2d3ed38f09ca1936b747551cf70707466abb3c82ad1b1ffe27944c015967

See more details on using hashes here.

File details

Details for the file bunkatopics-0.33-py3-none-any.whl.

File metadata

  • Download URL: bunkatopics-0.33-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.10.8 Darwin/21.3.0

File hashes

Hashes for bunkatopics-0.33-py3-none-any.whl
Algorithm Hash digest
SHA256 6cd28b6838cc36d0ab7a3f579171723470d3cb151319bb88ca65347517fd1aa5
MD5 62c3e2d9be1ca34f9b262a5d28542325
BLAKE2b-256 a7bf18032911ece72dae7a97f10bedb1e7f085a67135b21a68c93bd0d025d5c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page