Skip to main content

Topic Model User Evaluation

Project description

# Topiceval

Topiceval is a python library with functionalities to learn topic models (namely gensim LDA, Thresholded SVD, Sparse Coding - Sparse Dictionary Learning) over user mails stored in Outlook with an end interface for user-study on topic quality evaluation, as well as providing topic model classes for usage on independent data-sets.

Added functions include ability to run diagnostics on learned models, which include visualizations on topic-document distribution, topic entropy matrix, topic weights required to cross thresholds etc.

# Requires
- python 3.x or python >= 2.6
- For user-email evaluation - Windows and Outlook

# Dependencies
- scipy>=0.19.0
- gensim
- numpy
- matplotlib
- pandas
- scikit-learn
- nltk (optional)

# Installation

$ pip install topiceval
or
Download from source and "$ python setup.py install" (not yet tested)

# Usage
## User Email Study

$ python -m topiceval

You can also add options:
- -numtopics <int>: Number of topics to learn
- -usethreads <0/1>: Whether to treat each message in threaded conversation as different document (0) or not (1)
- -uselemma <0/1>: Whether to lemmatize words, set to 1 only if you have downloaded wordnet corpora for nltk package.
- -numtopicseval <int>: How many topics out of the learned ones to be shown to the user for evaluation.
- excludefolders <str>: Comma separated name of folders to exclude during mail extraction
- makeWtsvd <0/1>: Whether to make W matrix for TSVD, setting at 1 not yet tested

For example, to learn 20 topics from user mails using lemmatization, excluding folders named 'private' & 'work', and displaying 10 topics for evaluation, you have to use:

$ python -m topiceval -numtopics 20 -numtopicseval 10 -excludefolders private,work -uselemma 1

### User Evaluation Interface
![](https://raw.githubusercontent.com/Avikalp7/CG-Accumulator/master/Screenshots/email_interface.PNG)

## TSVD Topic Model

### Case 1
You have the docword file and vocab file.
"docword file" is in the format *doc_id word_id word_freq* per each line. Both *doc_id* and *word_id* are 1-indexed
"vocab file" has a vocabulary word in each line, where line number is the word_id.

from topiceval.thresholdedsvd.tsvd import TSVD

model = TSVD(docword_filepath="<your_docword_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)

learned_M_matrix = model.M_matrix
topic_catchwords = model.topicwise_catchwords

model.save_topic_top_words(filename="tsvdtopics.txt")
model.plot_topic_topwords(topicid_list=range(10))
model.plot_entropy_distribution()
model.plot_topic_entropy_colormap()

model.save_M_matrix(filename="<desired_filepath>")

### Outputs
The following are outputs for TSVD models on NIPS dataset from UCI BoW repository
- Snap from text file made by model.save_topic_topwords:
![](https://raw.githubusercontent.com/Avikalp7/CG-Accumulator/master/Screenshots/tsvd_save_topic_topwords.PNG)

- Plot from model.plot_topic_topwords
![](https://raw.githubusercontent.com/Avikalp7/CG-Accumulator/master/Screenshots/tsvd_plot_topic_topwords.PNG)

- Plot from model.plot_entropy_distribution
![](https://raw.githubusercontent.com/Avikalp7/CG-Accumulator/master/Screenshots/tsvd_plot_entropy_distribution.PNG)

- Plot from model.plot_topic_entropy_colormap
![](https://raw.githubusercontent.com/Avikalp7/CG-Accumulator/master/Screenshots/tsvd_plot_topic_entropy_colormap.PNG)

### Case 2
You have corpus in matrix-market format

from topiceval.thresholdedsvd.tsvd import TSVD

model = TSVD(corpus_filepath="<your_corpus_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)

or, if you have loaded corpus in your program, just use argument *corpus=<your_corpus_variable>*

### Case 3
You have made the A matrix

from topiceval.thresholdedsvd.tsvd import TSVD

model = TSVD(A_matrix_path="<your_Amatrix_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)
or, if you have loaded Amatrix in your program, just use argument *A_matrix=<your_Amatrix_variable>*

## BCD Topic Model

### Case 1
You have the docword file and vocab file.

from topiceval.dictionarylearning.bcd import BCD

model = BCD(docword_filepath="<your_docword_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)

# and sparsity, num iterations can be controlled by:
model = BCD(docword_filepath="<your_docword_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20, gamma_frac=0.35, nu_frac=0.15, bcd_iters=10)

### Case 2
You have corpus in matrix-market format

from topiceval.thresholdedsvd.tsvd import TSVD

model = BCD(corpus_filepath="<your_corpus_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)

or, if you have loaded corpus in your program, just use argument *corpus=<your_corpus_variable>*

### Case 3
You have made the X matrix (num_docs x vocab) or A matrix (vocab x num_docs)

from topiceval.thresholdedsvd.tsvd import TSVD

# for A matrix
model = BCD(A_matrix_path="<your_Amatrix_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)
# for X matrix
model = BCD(X_matrix_path="<your_Xmatrix_filepath>", vocab_filepath="<your_vocab_filepath>", num_topics = 20)

or, if you have loaded Amatrix in your program, just use argument *A_matrix=<your_Amatrix_variable>*. Similarly for X matrix.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topiceval-1.1.0.dev1.tar.gz (402.3 kB view details)

Uploaded Source

File details

Details for the file topiceval-1.1.0.dev1.tar.gz.

File metadata

File hashes

Hashes for topiceval-1.1.0.dev1.tar.gz
Algorithm Hash digest
SHA256 82b3bcb9649e061d787c31bbbe81d1264d7140de85c7b1d6a69a4704265f14ec
MD5 7cc67964b981bbc36700cda26e2b9a17
BLAKE2b-256 a92d38535fe956b5b11b454edacb5ad97c3deeb60b288eccd6da70f3b754a3fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page