A little wrapper for the topic modeling functions of MALLET
Project description
little-mallet-wrapper
This is a little Python wrapper around the topic modeling functions of MALLET.
Currently under construction; please send feedback/requests to Maria Antoniak.
Installation
pip install little_mallet_wrapper
Requirements
Usage
See demo.ipynb for a demonstration of how to use the functions in little-mallet-wrapper.
Documentation
print_dataset_stats(training_data)
Displays basic statistics about the training dataset.
Name | Type | Description |
---|---|---|
training_data |
list of strings | Documents that will be used to train the topic model. |
process_string(text, lowercase=True, remove_short_words=True, remove_stop_words=True, remove_punctuation=True, numbers='replace', stop_words=STOPS)
A simple string processor that prepares raw text for topic modeling. CAUTION: Depending on your data, you might need to write your own processing function. Do not rely on this function for non-English languages; both the stopword list and the punctuation removal assume English as input.
Name | Type | Description |
---|---|---|
text |
string | Individual document to process. |
lowercase |
boolean | Whether or not to lowercase the text. |
remove_short_words |
boolean | Whether or not to remove words with fewer than 2 characters. |
remove_stop_words |
boolean | Whether or not to remove stopwords. |
remove_punctuation |
boolean | Whether or not to remove punctuation (not A-Za-z0-9) |
remove_numbers |
string | 'replace' replaces all numbers with the normalized token NUM; 'remove' removes all numbers. |
stop_words |
list of strings | Custom list of words to remove. |
RETURNS | string | Processed version of the input text. |
quick_train_topic_model(path_to_mallet, output_directory_path, num_topics, training_data)
Imports training data, trains an LDA topic model using MALLET, and returns the topic keys and document distributions.
Name | Type | Description |
---|---|---|
path_to_mallet |
string | Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet |
output_directory_path |
string | Path to where the output files should be stored. |
num_topics |
integer | The number of topics to use for training. |
training_data |
list of strings | Processed documents for training the topic model. |
RETURNS | list of lists of strings | The 20 most probable words for each topic. |
RETURNS | list of lists of integers | Topic distribution (list of probabilities) for each document. |
import_data(path_to_mallet, path_to_training_data, path_to_formatted_training_data, training_data, training_ids=None, use_pipe_from=None)
Imports the training data into MALLET formatted data that can be used for training.
Name | Type | Description |
---|---|---|
path_to_mallet |
string | Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet |
path_to_training_data |
string | Path to where the training data should be stored. |
path_to_formatted_training_data |
string | Path to where the MALLET formatted training data should be stored. |
training_data |
list of strings | Processed documents for training the topic model. |
training_ids |
list of strings | Unique identifiers for the training data. |
use_pipe_from |
string | If you want to import the documents using the same model as a previous set of documents, include the path to the previous MALLET formatted training data. |
train_topic_model(path_to_mallet, path_to_formatted_training_data, path_to_model, path_to_topic_key, path_to_topic_distributions, path_to_word_weights, num_topics)
Trains an LDA topic model using MALLET.
Name | Type | Description |
---|---|---|
path_to_mallet |
string | Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet |
path_to_formatted_training_data |
string | Path to where the MALLET formatted training data is stored. |
path_to_model |
string | Path to where the model should be stored. |
path_to_topic_key |
string | Path to where the topic keys should be stored. |
path_to_topic_distributions |
string | Path to where the topic distributions should be stored. |
path_to_word_weights |
string | Path to where the word weights should be stored. |
num_topics |
integer | The number of topics to use for training. |
load_topic_keys(topic_keys_path)
Loads the most sets of most probable words for each topic after training a topic model.
Name | Type | Description |
---|---|---|
topic_keys_path |
string | Path to where the topic keys are stored. |
RETURNS | list of lists of strings | The 20 most probable words for each topic. |
load_topic_distributions(topic_distributions_path)
Loads the topic distribution for each document after training a topic model.
Name | Type | Description |
---|---|---|
topic_distributions_path |
string | Path to where the topic distributions are stored. |
RETURNS | list of lists of integers | Topic distribution (list of probabilities) for each document. |
load_training_ids(topic_distributions_path)
Loads the training IDs. This is either a list of sequential integers or the user-specified training IDs passed to import_data()
.
Name | Type | Description |
---|---|---|
topic_distributions_path |
string | Path to where the topic distributions are stored. |
RETURNS | list of lists of strings | List of training IDs in the same order as the topic distributions. |
load_topic_word_distributions(word_weight_path)
Loads the topic word distributions. These are the probabilities for each word for each topic.
Name | Type | Description |
---|---|---|
word_weight_path |
string | Path to where the word weights are stored. |
RETURNS | defaultdict of defaultdict of float | Map of topics to words to probabilities. |
get_top_docs(training_data, topic_distributions, topic_index, n=5)
Gets the documents with the highest probability for the target topic.
Name | Type | Description |
---|---|---|
training_data |
list of strings | Processed documents that was used to train the topic model. |
topic_distributions |
list of lists of integers | Topic distribution (list of probabilities) for each document. |
topic_index |
integer | The index of the target topic. |
n |
integer | The number of documents to return. |
RETURNS | list of tuples (float, string) | The topic probability and document text for the n documents with the highest probability for the target topic. |
plot_categories_by_topics_heatmap(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)
If the dataset includes some time of categorical labels, creates a heatmap of the labels x topics.
Name | Type | Description |
---|---|---|
labels |
list of strings | Document labels (e.g., authors of the documents, genres of the documents). |
topic_distributions |
list of lists of integers | Topic distribution (list of probabilities) for each document. |
topic_keys |
list of lists of strings | The 20 most probable words for each topic. |
output_path |
string | Path to where the resulting figure should be saved. |
target_labels |
list of strings | A subset of labels to use for plotting. |
dim |
tuple of integers | (x, y) dimensions for the resulting figure. |
plot_categories_by_topic_boxplots(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)
If the dataset includes some time of categorical labels, creates a set of boxplots, one plot for each topic.
Name | Type | Description |
---|---|---|
labels |
list of strings | Document labels (e.g., authors of the documents, genres of the documents). |
topic_distributions |
list of lists of integers | Topic distribution (list of probabilities) for each document. |
topic_keys |
list of lists of strings | The 20 most probable words for each topic. |
output_path |
string | Path to where the resulting figure should be saved. |
target_labels |
list of strings | A subset of labels to use for plotting. |
dim |
tuple of integers | (x, y) dimensions for the resulting figure. |
divide_training_data(documents, num_chunks=10)
Given a dataset, divides each document into a set of equally sized chunks.
Name | Type | Description |
---|---|---|
documents |
list of strings | Documents to split. |
num_chunks |
integer | How many times to split each document. |
RETURNS | tuple (list of strings, list of integers, list of floats) | The divided documents, the indices of the input documents, and the positions within the documents (0-1.0). |
infer_topics(path_to_mallet, path_to_original_model, path_to_new_formatted_training_data, path_to_new_topic_distributions)
Get topic distributions for a set of new documents using a model that has been trained on another set of documents.
Name | Type | Description |
---|---|---|
path_to_mallet |
string | Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet |
path_to_original_model |
string | Path to where the topic model was stored. |
path_to_new_formatted_training_data |
string | Path to where the MALLET formatted training data is stored. |
path_to_new_topic_distributions |
string | Path to where the topic distributions should be stored. |
plot_topics_over_time(topic_distributions, topic_keys, times, topic_index, output_path=None)
Creates lineplots, one for each topic, showing the mean topic probability over document segments.
Name | Type | Description |
---|---|---|
topic_distributions |
list of lists of integers | Topic distribution (list of probabilities) for each document. |
topic_keys |
list of lists of strings | The 20 most probable words for each topic. |
times |
list of floats | The division indices within the document. |
topic_index |
integer | The index of the target topic. |
output_path |
string | Path to where the resulting figure should be saved. |
get_js_divergence(topic_index_1, topic_index_2, topic_distributions)
Calculates the Jensen-Shannon divergence between the two target topic distributions.
Name | Type | Description |
---|---|---|
topic_index_1 |
integer | Index of the first target topic distribution. |
topic_index_2 |
integer | Index of the second target topic distribution. |
topic_distributions |
list of lists of integers | Topic distribution (list of probabilities) for each document. |
RETURNS | float | Jensen-Shannon divergence of the requested topic distributions. |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file little-mallet-wrapper-0.5.0.tar.gz
.
File metadata
- Download URL: little-mallet-wrapper-0.5.0.tar.gz
- Upload date:
- Size: 20.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/1.6.0 pkginfo/1.5.0.1 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c98592af4d4be4e732ae9ae3c6fae6a15af790029489779118aa92b11338d18 |
|
MD5 | 358701c4eac6cee067085cd189bb6421 |
|
BLAKE2b-256 | 9835ce9eba08be0264a743fbd77d319d963dfb14cb293bcb662dd5476f816c0a |
File details
Details for the file little_mallet_wrapper-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: little_mallet_wrapper-0.5.0-py3-none-any.whl
- Upload date:
- Size: 19.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/1.6.0 pkginfo/1.5.0.1 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 850af1d206cea986e8dff98e6d69124c92e49ce1d082fd17de0aaabf9e35d57e |
|
MD5 | 585aad91951298e656057152fdda884c |
|
BLAKE2b-256 | e3017e8561e33e79b408d9526b22b50e20bfdd8e551979237ad5c972759fe7d8 |