Automatic detection of robust parametrizations for LDA and NMF. Compatible with scikit-learn and gensim.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.5
- Python :: 3.6
Topic
- Text Processing

Project description

robics

robustTopics is a library targeted at non-machine learning experts interested in building robust topic models. The main goal is to provide a simple to use framework to check if a topic model reaches each run the same or at least a similar result.

Features

Supports sklearn (LatentDirichletAllocation, NMF) and gensim (LdaModel, ldamulticore, nmf) topic models
Creates samples based on the sobol sequence which requires less samples than grid-search and makes sure the whole parameter space is used which is not sure in random-sampling.
Simple topic matching between the different re-initializations for each sample using word vector based coherence scores.
Ranking of all models based on three metrics:
- Jaccard distance of the top n words for each topic
- Similarity of topic distributions based on the Jensen Shannon Divergence
- Ranking correlation of the top n words based on Kendall's Tau
Word based analysis of samples and topic model instances.

Install

Python Version: 3.5+
Package Managers: pip

pip

Using pip, robics releases are available as source packages and binary wheels:

pip install robics

Example

This is a full example including the preprocessing steps. Feel free to adapt it to your own needs.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from gensim.models import LdaModel, nmf, ldamulticore
from gensim.utils import simple_preprocess
from gensim import corpora
import spacy
from robics import robustTopics

nlp = spacy.load("en")

# PREPROCESSING
dataset = fetch_20newsgroups(
    shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data[:1000] # Only 1000 dokuments for performance reasons

# sklearn
no_features = 1000

# counts for the NMF model
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# tfidf for the LDA model
tf_vectorizer = CountVectorizer(
    max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

# gensim
def docs_to_words(docs):
    for doc in docs:
        yield(simple_preprocess(str(doc), deacc=True))

tokenized_data = list(docs_to_words(documents))
dictionary = corpora.Dictionary(tokenized_data)
corpus = [dictionary.doc2bow(text) for text in tokenized_data]

# TOPIC MODELLING
robustTopics = RobustTopics(nlp)

# Load 4 different models
robustTopics.load_gensim_model(
    ldamulticore.LdaModel, corpus, dictionary, n_samples=5, n_initializations=6)
robustTopics.load_gensim_model(
    nmf.Nmf, corpus, dictionary, n_samples=5, n_initializations=6)
robustTopics.load_sklearn_model(
    LatentDirichletAllocation, tf, tf_vectorizer, n_samples=5, n_initializations=6)
robustTopics.load_sklearn_model(NMF, tf, tf_vectorizer, n_samples=5, n_initializations=3)

robustTopics.fit_models()

# ANALYSIS
# Compare different samples
robustTopics.rank_models()

# Look at the topics
robustTopics.display_sample_topics(1, 0, 0.5)
robustTopics.display_run_topics(0, 0, 0, 10)

# Look at the full reports inclusing separate values for each initialization
robustTopics.models[model_id].report_full

# Convert the reports to a pandas dataframe
pd.DataFrame.from_records(robustTopics.models[model_id].report)

Next Steps

Adding support for more modells if required.
Writing unit tests.
Improving the overall performance.
Implementing the Cv coherence measure from this paper

Contribution

I am happy to receive help in any of the things mentioned above or other interesting feature request.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.5
- Python :: 3.6
Topic
- Text Processing

Release history Release notifications | RSS feed

0.21

Nov 13, 2020

This version

0.11

Mar 15, 2020

0.1

Mar 15, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robics-0.11.tar.gz (11.0 kB view hashes)

Uploaded Mar 15, 2020 Source

Hashes for robics-0.11.tar.gz

Hashes for robics-0.11.tar.gz
Algorithm	Hash digest
SHA256	`bec0462a3215abe93e0bbbe38a6e9b7dc4fb0c268242300fca748f83ce7c1113`
MD5	`678c8418745fb5add6bb3214a99e16c1`
BLAKE2b-256	`d693d9a9ea834aec427cc7f521e59a5879fc15442d76d0f61afb4997080176eb`