Skip to main content

Topic Model User Evaluation

Project description

# Topiceval

Topiceval is a python library with functionalities to learn topic models (namely gensim LDA, Thresholded SVD, Sparse Coding - Sparse Dictionary Learning) over user mails stored in Outlook with an end interface for user-study on topic quality evaluation, as well as providing topic model classes for usage on independent data-sets.

Added functions include ability to run diagnostics on learned models, which include visualizations on topic-document distribution, topic entropy matrix, topic weights required to cross thresholds etc.

# Requires
- python 3.x or python >= 2.6
- For user-email evaluation - Windows and Outlook

# Dependencies
- scipy>=0.19.0
- gensim
- numpy
- matplotlib
- pandas
- scikit-learn
- nltk (optional)

# Installation

$ pip install topiceval
or
Download from source and "$ python setup.py install" (not yet tested)

# Usage
## User Email Study

$ python -m topiceval

You can also add options:
- -numtopics <int>: Number of topics to learn
- -usethreads <0/1>: Whether to treat each message in threaded conversation as different document (0) or not (1)
- -uselemma <0/1>: Whether to lemmatize words, set to 1 only if you have downloaded wordnet corpora for nltk package.
- -numtopicseval <int>: How many topics out of the learned ones to be shown to the user for evaluation.
- excludefolders <str>: Comma separated name of folders to exclude during mail extraction
- makeWtsvd <0/1>: Whether to make W matrix for TSVD, setting at 1 not yet tested

For example, to learn 20 topics from user mails using lemmatization, excluding folders named 'private' & 'work', and displaying 10 topics for evaluation, you have to use:

$ python -m topiceval -numtopics 20 -numtopicseval 10 -excludefolders private,work -uselemma 1

### User Evaluation Interface
![](https://raw.githubusercontent.com/Avikalp7/CG-Accumulator/master/Screenshots/email_interface.PNG)

## TSVD Topic Model

### Case 1
You have the docword file and vocab file.
"docword file" is in the format *doc_id word_id word_freq* per each line. Both *doc_id* and *word_id* are 1-indexed
"vocab file" has a vocabulary word in each line, where line number is the word_id.

from topiceval.thresholdedsvd.tsvd import TSVD

model = TSVD(docword_filepath="<your_docword_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)

learned_M_matrix = model.M_matrix
topic_catchwords = model.topicwise_catchwords

model.save_topic_top_words(filename="tsvdtopics.txt")
model.plot_topic_topwords(topicid_list=range(10))
model.plot_entropy_distribution()
model.plot_topic_entropy_colormap()

model.save_M_matrix(filename="<desired_filepath>")

### Outputs
The following are outputs for TSVD models on NIPS dataset from UCI BoW repository
- Snap from text file made by model.save_topic_topwords:
![](https://raw.githubusercontent.com/Avikalp7/CG-Accumulator/master/Screenshots/tsvd_save_topic_topwords.PNG)

- Plot from model.plot_topic_topwords
![](https://raw.githubusercontent.com/Avikalp7/CG-Accumulator/master/Screenshots/tsvd_plot_topic_topwords.PNG)

- Plot from model.plot_entropy_distribution
![](https://raw.githubusercontent.com/Avikalp7/CG-Accumulator/master/Screenshots/tsvd_plot_entropy_distribution.PNG)

- Plot from model.plot_topic_entropy_colormap
![](https://raw.githubusercontent.com/Avikalp7/CG-Accumulator/master/Screenshots/tsvd_plot_topic_entropy_colormap.PNG)

### Case 2
You have corpus in matrix-market format

from topiceval.thresholdedsvd.tsvd import TSVD

model = TSVD(corpus_filepath="<your_corpus_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)

or, if you have loaded corpus in your program, just use argument *corpus=<your_corpus_variable>*

### Case 3
You have made the A matrix

from topiceval.thresholdedsvd.tsvd import TSVD

model = TSVD(A_matrix_path="<your_Amatrix_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)
or, if you have loaded Amatrix in your program, just use argument *A_matrix=<your_Amatrix_variable>*

## BCD Topic Model

### Case 1
You have the docword file and vocab file.

from topiceval.dictionarylearning.bcd import BCD

model = BCD(docword_filepath="<your_docword_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)

# and sparsity, num iterations can be controlled by:
model = BCD(docword_filepath="<your_docword_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20, gamma_frac=0.35, nu_frac=0.15, bcd_iters=10)

### Case 2
You have corpus in matrix-market format

from topiceval.thresholdedsvd.tsvd import TSVD

model = BCD(corpus_filepath="<your_corpus_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)

or, if you have loaded corpus in your program, just use argument *corpus=<your_corpus_variable>*

### Case 3
You have made the X matrix (num_docs x vocab) or A matrix (vocab x num_docs)

from topiceval.thresholdedsvd.tsvd import TSVD

# for A matrix
model = BCD(A_matrix_path="<your_Amatrix_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)
# for X matrix
model = BCD(X_matrix_path="<your_Xmatrix_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)

or, if you have loaded Amatrix in your program, just use argument *A_matrix=<your_Amatrix_variable>*. Similarly for X matrix.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topiceval-1.1.0.dev1.tar.gz (402.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page