A little wrapper for the topic modeling functions of MALLET
Project description
little-mallet-wrapper
This is a little Python wrapper around the topic modeling functions of MALLET.
Currently under construction; please send feedback/requests to Maria Antoniak.
Installation
pip install little_mallet_wrapper
Requirements
Usage
See demo.ipynb for a demonstration of how to use the functions in little-mallet-wrapper.
Documentation
print_dataset_stats(training_data)
Displays basic statistics about the training dataset.
Name | Type | Description |
---|---|---|
training_data |
list of strings | Documents that will be used to train the topic model. |
process_string(text, lowercase=True, remove_short_words=True, remove_stop_words=True, remove_punctuation=True, numbers='replace', stop_words=STOPS)
A simple string processor that prepares raw text for topic modeling. CAUTION: Depending on your data, you might need to write your own processing function. Do not rely on this function for non-English languages; both the stopword list and the punctuation removal assume English as input.
Name | Type | Description |
---|---|---|
text |
string | Individual document to process. |
lowercase |
boolean | Whether or not to lowercase the text. |
remove_short_words |
boolean | Whether or not to remove words with fewer than 2 characters. |
remove_stop_words |
boolean | Whether or not to remove stopwords. |
remove_punctuation |
boolean | Whether or not to remove punctuation (not A-Za-z0-9) |
remove_numbers |
string | 'replace' replaces all numbers with the normalized token NUM; 'remove' removes all numbers. |
stop_words |
list of strings | Custom list of words to remove. |
RETURNS | string | Processed version of the input text. |
quick_train_topic_model(path_to_mallet, output_directory_path, num_topics, training_data)
Imports training data, trains an LDA topic model using MALLET, and returns the topic keys and document distributions.
Name | Type | Description |
---|---|---|
path_to_mallet |
string | Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet |
output_directory_path |
string | Path to where the output files should be stored. |
num_topics |
integer | The number of topics to use for training. |
training_data |
list of strings | Processed documents for training the topic model. |
RETURNS | list of lists of strings | The 20 most probable words for each topic. |
RETURNS | list of lists of integers | Topic distribution (list of probabilities) for each document. |
import_data(path_to_mallet, path_to_training_data, path_to_formatted_training_data, training_data, training_ids=None, use_pipe_from=None)
Imports the training data into MALLET formatted data that can be used for training.
Name | Type | Description |
---|---|---|
path_to_mallet |
string | Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet |
path_to_training_data |
string | Path to where the training data should be stored. |
path_to_formatted_training_data |
string | Path to where the MALLET formatted training data should be stored. |
training_data |
list of strings | Processed documents for training the topic model. |
training_ids |
list of strings | Unique identifiers for the training data. |
use_pipe_from |
string | If you want to import the documents using the same model as a previous set of documents, include the path to the previous MALLET formatted training data. |
train_topic_model(path_to_mallet, path_to_formatted_training_data, path_to_model, path_to_topic_key, path_to_topic_distributions, path_to_word_weights, num_topics)
Trains an LDA topic model using MALLET.
Name | Type | Description |
---|---|---|
path_to_mallet |
string | Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet |
path_to_formatted_training_data |
string | Path to where the MALLET formatted training data is stored. |
path_to_model |
string | Path to where the model should be stored. |
path_to_topic_key |
string | Path to where the topic keys should be stored. |
path_to_topic_distributions |
string | Path to where the topic distributions should be stored. |
path_to_word_weights |
string | Path to where the word weights should be stored. |
num_topics |
integer | The number of topics to use for training. |
load_topic_keys(topic_keys_path)
Loads the most sets of most probable words for each topic after training a topic model.
Name | Type | Description |
---|---|---|
topic_keys_path |
string | Path to where the topic keys are stored. |
RETURNS | list of lists of strings | The 20 most probable words for each topic. |
load_topic_distributions(topic_distributions_path)
Loads the topic distribution for each document after training a topic model.
Name | Type | Description |
---|---|---|
topic_distributions_path |
string | Path to where the topic distributions are stored. |
RETURNS | list of lists of integers | Topic distribution (list of probabilities) for each document. |
load_training_ids(topic_distributions_path)
Loads the training IDs. This is either a list of sequential integers or the user-specified training IDs passed to import_data()
.
Name | Type | Description |
---|---|---|
topic_distributions_path |
string | Path to where the topic distributions are stored. |
RETURNS | list of lists of strings | List of training IDs in the same order as the topic distributions. |
load_topic_word_distributions(word_weight_path)
Loads the topic word distributions. These are the probabilities for each word for each topic.
Name | Type | Description |
---|---|---|
word_weight_path |
string | Path to where the word weights are stored. |
RETURNS | defaultdict of defaultdict of float | Map of topics to words to probabilities. |
get_top_docs(training_data, topic_distributions, topic_index, n=5)
Gets the documents with the highest probability for the target topic.
Name | Type | Description |
---|---|---|
training_data |
list of strings | Processed documents that was used to train the topic model. |
topic_distributions |
list of lists of integers | Topic distribution (list of probabilities) for each document. |
topic_index |
integer | The index of the target topic. |
n |
integer | The number of documents to return. |
RETURNS | list of tuples (float, string) | The topic probability and document text for the n documents with the highest probability for the target topic. |
plot_categories_by_topics_heatmap(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)
If the dataset includes some time of categorical labels, creates a heatmap of the labels x topics.
Name | Type | Description |
---|---|---|
labels |
list of strings | Document labels (e.g., authors of the documents, genres of the documents). |
topic_distributions |
list of lists of integers | Topic distribution (list of probabilities) for each document. |
topic_keys |
list of lists of strings | The 20 most probable words for each topic. |
output_path |
string | Path to where the resulting figure should be saved. |
target_labels |
list of strings | A subset of labels to use for plotting. |
dim |
tuple of integers | (x, y) dimensions for the resulting figure. |
plot_categories_by_topic_boxplots(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)
If the dataset includes some time of categorical labels, creates a set of boxplots, one plot for each topic.
Name | Type | Description |
---|---|---|
labels |
list of strings | Document labels (e.g., authors of the documents, genres of the documents). |
topic_distributions |
list of lists of integers | Topic distribution (list of probabilities) for each document. |
topic_keys |
list of lists of strings | The 20 most probable words for each topic. |
output_path |
string | Path to where the resulting figure should be saved. |
target_labels |
list of strings | A subset of labels to use for plotting. |
dim |
tuple of integers | (x, y) dimensions for the resulting figure. |
divide_training_data(documents, num_chunks=10)
Given a dataset, divides each document into a set of equally sized chunks.
Name | Type | Description |
---|---|---|
documents |
list of strings | Documents to split. |
num_chunks |
integer | How many times to split each document. |
RETURNS | tuple (list of strings, list of integers, list of floats) | The divided documents, the indices of the input documents, and the positions within the documents (0-1.0). |
infer_topics(path_to_mallet, path_to_original_model, path_to_new_formatted_training_data, path_to_new_topic_distributions)
Get topic distributions for a set of new documents using a model that has been trained on another set of documents.
Name | Type | Description |
---|---|---|
path_to_mallet |
string | Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet |
path_to_original_model |
string | Path to where the topic model was stored. |
path_to_new_formatted_training_data |
string | Path to where the MALLET formatted training data is stored. |
path_to_new_topic_distributions |
string | Path to where the topic distributions should be stored. |
plot_topics_over_time(topic_distributions, topic_keys, times, topic_index, output_path=None)
Creates lineplots, one for each topic, showing the mean topic probability over document segments.
Name | Type | Description |
---|---|---|
topic_distributions |
list of lists of integers | Topic distribution (list of probabilities) for each document. |
topic_keys |
list of lists of strings | The 20 most probable words for each topic. |
times |
list of floats | The division indices within the document. |
topic_index |
integer | The index of the target topic. |
output_path |
string | Path to where the resulting figure should be saved. |
get_js_divergence(topic_index_1, topic_index_2, topic_distributions)
Calculates the Jensen-Shannon divergence between the two target topic distributions.
Name | Type | Description |
---|---|---|
topic_index_1 |
integer | Index of the first target topic distribution. |
topic_index_2 |
integer | Index of the second target topic distribution. |
topic_distributions |
list of lists of integers | Topic distribution (list of probabilities) for each document. |
RETURNS | float | Jensen-Shannon divergence of the requested topic distributions. |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for little-mallet-wrapper-0.5.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c98592af4d4be4e732ae9ae3c6fae6a15af790029489779118aa92b11338d18 |
|
MD5 | 358701c4eac6cee067085cd189bb6421 |
|
BLAKE2b-256 | 9835ce9eba08be0264a743fbd77d319d963dfb14cb293bcb662dd5476f816c0a |
Hashes for little_mallet_wrapper-0.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 850af1d206cea986e8dff98e6d69124c92e49ce1d082fd17de0aaabf9e35d57e |
|
MD5 | 585aad91951298e656057152fdda884c |
|
BLAKE2b-256 | e3017e8561e33e79b408d9526b22b50e20bfdd8e551979237ad5c972759fe7d8 |