A little wrapper for the topic modeling functions of MALLET

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

little-mallet-wrapper

This is a little Python wrapper around the topic modeling functions of MALLET.

Currently under construction; please send feedback/requests to Maria Antoniak.

Installation

pip install little_mallet_wrapper

Requirements

Python 3.7
MALLET
pandas
numpy
seaborn (for plotting functions)

Usage

See demo.ipynb for a demonstration of how to use the functions in little-mallet-wrapper.

Documentation

`print_dataset_stats(training_data)`

Displays basic statistics about the training dataset.

Name	Type	Description
`training_data`	list of strings	Documents that will be used to train the topic model.

`process_string(text, lowercase=True, remove_short_words=True, remove_stop_words=True, remove_punctuation=True, numbers='replace', stop_words=STOPS)`

A simple string processor that prepares raw text for topic modeling. CAUTION: Depending on your data, you might need to write your own processing function. Do not rely on this function for non-English languages; both the stopword list and the punctuation removal assume English as input.

Name	Type	Description
`text`	string	Individual document to process.
`lowercase`	boolean	Whether or not to lowercase the text.
`remove_short_words`	boolean	Whether or not to remove words with fewer than 2 characters.
`remove_stop_words`	boolean	Whether or not to remove stopwords.
`remove_punctuation`	boolean	Whether or not to remove punctuation (not A-Za-z0-9)
`remove_numbers`	string	'replace' replaces all numbers with the normalized token NUM; 'remove' removes all numbers.
`stop_words`	list of strings	Custom list of words to remove.
RETURNS	string	Processed version of the input text.

`quick_train_topic_model(path_to_mallet, output_directory_path, num_topics, training_data)`

Imports training data, trains an LDA topic model using MALLET, and returns the topic keys and document distributions.

Name	Type	Description
`path_to_mallet`	string	Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet
`output_directory_path`	string	Path to where the output files should be stored.
`num_topics`	integer	The number of topics to use for training.
`training_data`	list of strings	Processed documents for training the topic model.
RETURNS	list of lists of strings	The 20 most probable words for each topic.
RETURNS	list of lists of integers	Topic distribution (list of probabilities) for each document.

`import_data(path_to_mallet, path_to_training_data, path_to_formatted_training_data, training_data, training_ids=None, use_pipe_from=None)`

Imports the training data into MALLET formatted data that can be used for training.

Name	Type	Description
`path_to_mallet`	string	Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet
`path_to_training_data`	string	Path to where the training data should be stored.
`path_to_formatted_training_data`	string	Path to where the MALLET formatted training data should be stored.
`training_data`	list of strings	Processed documents for training the topic model.
`training_ids`	list of strings	Unique identifiers for the training data.
`use_pipe_from`	string	If you want to import the documents using the same model as a previous set of documents, include the path to the previous MALLET formatted training data.

`train_topic_model(path_to_mallet, path_to_formatted_training_data, path_to_model, path_to_topic_key, path_to_topic_distributions, path_to_word_weights, num_topics)`

Trains an LDA topic model using MALLET.

Name	Type	Description
`path_to_mallet`	string	Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet
`path_to_formatted_training_data`	string	Path to where the MALLET formatted training data is stored.
`path_to_model`	string	Path to where the model should be stored.
`path_to_topic_key`	string	Path to where the topic keys should be stored.
`path_to_topic_distributions`	string	Path to where the topic distributions should be stored.
`path_to_word_weights`	string	Path to where the word weights should be stored.
`num_topics`	integer	The number of topics to use for training.

`load_topic_keys(topic_keys_path)`

Loads the most sets of most probable words for each topic after training a topic model.

Name	Type	Description
`topic_keys_path`	string	Path to where the topic keys are stored.
RETURNS	list of lists of strings	The 20 most probable words for each topic.

`load_topic_distributions(topic_distributions_path)`

Loads the topic distribution for each document after training a topic model.

Name	Type	Description
`topic_distributions_path`	string	Path to where the topic distributions are stored.
RETURNS	list of lists of integers	Topic distribution (list of probabilities) for each document.

`load_training_ids(topic_distributions_path)`

Loads the training IDs. This is either a list of sequential integers or the user-specified training IDs passed to import_data().

Name	Type	Description
`topic_distributions_path`	string	Path to where the topic distributions are stored.
RETURNS	list of lists of strings	List of training IDs in the same order as the topic distributions.

`load_topic_word_distributions(word_weight_path)`

Loads the topic word distributions. These are the probabilities for each word for each topic.

Name	Type	Description
`word_weight_path`	string	Path to where the word weights are stored.
RETURNS	defaultdict of defaultdict of float	Map of topics to words to probabilities.

`get_top_docs(training_data, topic_distributions, topic_index, n=5)`

Gets the documents with the highest probability for the target topic.

Name	Type	Description
`training_data`	list of strings	Processed documents that was used to train the topic model.
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
`topic_index`	integer	The index of the target topic.
`n`	integer	The number of documents to return.
RETURNS	list of tuples (float, string)	The topic probability and document text for the n documents with the highest probability for the target topic.

`plot_categories_by_topics_heatmap(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)`

If the dataset includes some time of categorical labels, creates a heatmap of the labels x topics.

Name	Type	Description
`labels`	list of strings	Document labels (e.g., authors of the documents, genres of the documents).
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
`topic_keys`	list of lists of strings	The 20 most probable words for each topic.
`output_path`	string	Path to where the resulting figure should be saved.
`target_labels`	list of strings	A subset of `labels` to use for plotting.
`dim`	tuple of integers	(x, y) dimensions for the resulting figure.

`plot_categories_by_topic_boxplots(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)`

If the dataset includes some time of categorical labels, creates a set of boxplots, one plot for each topic.

Name	Type	Description
`labels`	list of strings	Document labels (e.g., authors of the documents, genres of the documents).
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
`topic_keys`	list of lists of strings	The 20 most probable words for each topic.
`output_path`	string	Path to where the resulting figure should be saved.
`target_labels`	list of strings	A subset of `labels` to use for plotting.
`dim`	tuple of integers	(x, y) dimensions for the resulting figure.

`divide_training_data(documents, num_chunks=10)`

Given a dataset, divides each document into a set of equally sized chunks.

Name	Type	Description
`documents`	list of strings	Documents to split.
`num_chunks`	integer	How many times to split each document.
RETURNS	tuple (list of strings, list of integers, list of floats)	The divided documents, the indices of the input documents, and the positions within the documents (0-1.0).

`infer_topics(path_to_mallet, path_to_original_model, path_to_new_formatted_training_data, path_to_new_topic_distributions)`

Get topic distributions for a set of new documents using a model that has been trained on another set of documents.

Name	Type	Description
`path_to_mallet`	string	Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet
`path_to_original_model`	string	Path to where the topic model was stored.
`path_to_new_formatted_training_data`	string	Path to where the MALLET formatted training data is stored.
`path_to_new_topic_distributions`	string	Path to where the topic distributions should be stored.

`plot_topics_over_time(topic_distributions, topic_keys, times, topic_index, output_path=None)`

Creates lineplots, one for each topic, showing the mean topic probability over document segments.

Name	Type	Description
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
`topic_keys`	list of lists of strings	The 20 most probable words for each topic.
`times`	list of floats	The division indices within the document.
`topic_index`	integer	The index of the target topic.
`output_path`	string	Path to where the resulting figure should be saved.

`get_js_divergence(topic_index_1, topic_index_2, topic_distributions)`

Calculates the Jensen-Shannon divergence between the two target topic distributions.

Name	Type	Description
`topic_index_1`	integer	Index of the first target topic distribution.
`topic_index_2`	integer	Index of the second target topic distribution.
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
RETURNS	float	Jensen-Shannon divergence of the requested topic distributions.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.5.0

May 12, 2021

0.4.0

May 11, 2021

0.3.0

May 10, 2021

0.2.0

Dec 13, 2020

0.1.0

Sep 28, 2020

0.0.12

May 8, 2020

0.0.11

Apr 25, 2020

0.0.10

Apr 2, 2020

0.0.9

Apr 2, 2020

0.0.8

Apr 2, 2020

0.0.7

Mar 30, 2020

0.0.6

Mar 26, 2020

0.0.5

Mar 25, 2020

0.0.4

Mar 20, 2020

0.0.3

Mar 20, 2020

0.0.2

Mar 20, 2020

0.0.1

Mar 2, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

little-mallet-wrapper-0.5.0.tar.gz (20.0 kB view details)

Uploaded May 12, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

little_mallet_wrapper-0.5.0-py3-none-any.whl (19.5 kB view details)

Uploaded May 12, 2021 Python 3

File details

Details for the file little-mallet-wrapper-0.5.0.tar.gz.

File metadata

Download URL: little-mallet-wrapper-0.5.0.tar.gz
Upload date: May 12, 2021
Size: 20.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/1.6.0 pkginfo/1.5.0.1 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.9

File hashes

Hashes for little-mallet-wrapper-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`8c98592af4d4be4e732ae9ae3c6fae6a15af790029489779118aa92b11338d18`
MD5	`358701c4eac6cee067085cd189bb6421`
BLAKE2b-256	`9835ce9eba08be0264a743fbd77d319d963dfb14cb293bcb662dd5476f816c0a`

See more details on using hashes here.

File details

Details for the file little_mallet_wrapper-0.5.0-py3-none-any.whl.

File metadata

Download URL: little_mallet_wrapper-0.5.0-py3-none-any.whl
Upload date: May 12, 2021
Size: 19.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/1.6.0 pkginfo/1.5.0.1 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.9

File hashes

Hashes for little_mallet_wrapper-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`850af1d206cea986e8dff98e6d69124c92e49ce1d082fd17de0aaabf9e35d57e`
MD5	`585aad91951298e656057152fdda884c`
BLAKE2b-256	`e3017e8561e33e79b408d9526b22b50e20bfdd8e551979237ad5c972759fe7d8`

See more details on using hashes here.

little-mallet-wrapper 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

little-mallet-wrapper

Installation

Requirements

Usage

Documentation

print_dataset_stats(training_data)

process_string(text, lowercase=True, remove_short_words=True, remove_stop_words=True, remove_punctuation=True, numbers='replace', stop_words=STOPS)

quick_train_topic_model(path_to_mallet, output_directory_path, num_topics, training_data)

import_data(path_to_mallet, path_to_training_data, path_to_formatted_training_data, training_data, training_ids=None, use_pipe_from=None)

train_topic_model(path_to_mallet, path_to_formatted_training_data, path_to_model, path_to_topic_key, path_to_topic_distributions, path_to_word_weights, num_topics)

load_topic_keys(topic_keys_path)

load_topic_distributions(topic_distributions_path)

load_training_ids(topic_distributions_path)

load_topic_word_distributions(word_weight_path)

get_top_docs(training_data, topic_distributions, topic_index, n=5)

plot_categories_by_topics_heatmap(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)

plot_categories_by_topic_boxplots(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)

divide_training_data(documents, num_chunks=10)

infer_topics(path_to_mallet, path_to_original_model, path_to_new_formatted_training_data, path_to_new_topic_distributions)

plot_topics_over_time(topic_distributions, topic_keys, times, topic_index, output_path=None)

get_js_divergence(topic_index_1, topic_index_2, topic_distributions)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`print_dataset_stats(training_data)`

`process_string(text, lowercase=True, remove_short_words=True, remove_stop_words=True, remove_punctuation=True, numbers='replace', stop_words=STOPS)`

`quick_train_topic_model(path_to_mallet, output_directory_path, num_topics, training_data)`

`import_data(path_to_mallet, path_to_training_data, path_to_formatted_training_data, training_data, training_ids=None, use_pipe_from=None)`

`train_topic_model(path_to_mallet, path_to_formatted_training_data, path_to_model, path_to_topic_key, path_to_topic_distributions, path_to_word_weights, num_topics)`

`load_topic_keys(topic_keys_path)`

`load_topic_distributions(topic_distributions_path)`

`load_training_ids(topic_distributions_path)`

`load_topic_word_distributions(word_weight_path)`

`get_top_docs(training_data, topic_distributions, topic_index, n=5)`

`plot_categories_by_topics_heatmap(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)`

`plot_categories_by_topic_boxplots(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)`

`divide_training_data(documents, num_chunks=10)`

`infer_topics(path_to_mallet, path_to_original_model, path_to_new_formatted_training_data, path_to_new_topic_distributions)`

`plot_topics_over_time(topic_distributions, topic_keys, times, topic_index, output_path=None)`

`get_js_divergence(topic_index_1, topic_index_2, topic_distributions)`