Text Module of REACT
Project description
Urban Risk Lab (URL) Text Analysis Module
The goal of this module is to utilize various machine learning algorithms and featurizations of text string inputs to yield efficient and accurate predictions from text data in crowdsourced crisis reports in order to provide quick categorization that can be used to construct an aggregate summary of the unfolding crisis event. Beyond classification, this module embeds utilities for performing exploratory data analysis (EDA) and clustering of the text data. The module provides utilities for experimenting with various text embeddings including unigrams, bigrams, TF-IDF, and pretrained BERT with CLS pooling embeddings for both classification and clustering experiments. For the experiments mentioned above, the module provides analysis tools for performing EDA, visualizing clustering results, classification model performance, and any associated plotting.
This project is compatible with Python version >= 3.7.
Instructions to Install
Using PyPI -- latest version of package on PyPI
pip install url-text-module
Using GitLab Credentials -- using most recent commit
- Get
.env
file by requesting it from url_googleai@mit.edu, use subject headline[Read Credentials URL Text Module GitLab]
and your plans for using it. - Load variables into the environment:
source <path_to_.env>
- run
pip install -e git+https://$GITLAB_TOKEN_USER:$GITLAB_TOKEN@gitlab.com/react76/url-text-module.git@master#egg=url-text-module
How to use in Python
At the moment, all classes, constants, and functions can be imported at the root level of the package, like so:
from url_text_module import (
...
)
Package Structure & Utilities
This module provides various utilites for conducting reproducible experiments on crowdsourced crisis report text. These utilities include:
Preprocessing & Featurizing Text Data
- preprocessing_utils.py - Utilities for preprocessing and cleaning raw text inputs, e.g. tokenizing, removing stopwords, lemmatizing.
- Includes language specific utilites such as
tokenize_with_fugashi
for Japanese text and thePretrainedBERTTokenizerAndModel
class for a paired tokenizer and BERT model pretrained on a large language-specific corpus to enable the creation of contextualized feature embedding (or feature vector) of an input text document with minimal preprocessing. - Also contains utilties for normalizing the input featurizations/embeddings as input to a machine learning model (see
standardize_X
) - A function for using the metadata of a parameterized featurization to load the correct raw input string preprocessor in order to provide the proper input to the corresponding featurizer object (see
create_input_string_preprocessor
)
- Includes language specific utilites such as
- vectorizer_utils.py - Contains the language-agnostic Bag-of-Words (BOW) (see
BOWVectorizer
) and TF-IDF vectorizers (seeTFIDFVectorizer
) for the featurization of preprocessed, tokenized text data prior to being inputted to a machine learning model. Once fitted on a training text data corpus, the vectorizers store various useful attributes including the vocabulary of unique n-grams found for the training document corpus and a mapping from the n-gram to the coresponding index in the featurization vector. Note: Using these vectorizers might make private information that might need to be scrubbed (i.e. names, addresses, etc.) public via the vocabulary attributes if models using these vectorizers on the inputs are open-sourced
Classifying Text Data
- classification.py - Utilities for performing classification on crisis text data including:
- Performing cross-validation (CV) for tuning using grid search and a grid of hyperparameters to use for an algorithm and featurizations to apply to the input text (see
CrossValidationWithGridSearch
) - Performing algorithm selection using Nested Cross Validation (nested CV) (see
NestedCVWithGridSearch
) - Utilties for saving the results of classification experiments using cross validation or nested CV.
- A function for developing a dummy classifier baseline for a classification experiment (see
get_dummy_classifier_preds_and_probs
)
- Performing cross-validation (CV) for tuning using grid search and a grid of hyperparameters to use for an algorithm and featurizations to apply to the input text (see
Clustering Text Data
- dimensionality_reduction.py - Utilities for reducing the dimensionality of featurized text data
- clustering.py - Contains utilities for performing clustering experiments:
- For experimenting with different K values for a clustering algorithm (see
run_clustering_experiment
) - For experimenting with various hyperparameter combinations including the type of featurization to use on input data (unigrams, bigrams, TF-IDF, and BERT with CLS pooling), the type of dimensionality reduction algorithm (None, PCA, and t-SNE) to use on the featurized text data, and the type of clustering algorithm (K-means and K-medoids) to use to cluster the data points (see
clustering_tuning
) - For a specific combination of hyperparameters mentioned above, also has auxillary utilities useful for the analysis of each cluster found including seeing the n text data inputs which are closest to the cluster center and finding the top n-grams with highest TF-IDF score for a cluster by forming a cluster-level document corpus by concatenating the documents of each cluster to form a new document representing the cluster (see
visualize_cluster_results
&save_cluster_dfs
)
- For experimenting with different K values for a clustering algorithm (see
Plotting Utilities
- plotting_utils.py - Utilities for producing visualizations for conducting analysis including exploratory data analysis and analysis of the results of clustering and classification experiments:
- EDA:
- Plotting a histogram of a feature's distribution, e.g. character count of text strings in a dataset (see
plot_data_hist
) - Plotting a box-and-whisker plot of a feature's distribution (see
plot_data_box_plot
) - Plotting a side-by-side comparison via box-and-whisker plots of a feature's distribution between two text datasets (see
plot_box_plot_data_comparison
)
- Plotting a histogram of a feature's distribution, e.g. character count of text strings in a dataset (see
- Classification:
- Plotting the results of various algorithms and an associated hyperparameter + featurization grid for a Nested Cross Validation procedure for Algorithm Selection (see
plot_nested_cv_results
) - Plotting the precision-recall curve for a binary classifier (see
plot_precision_recall_curve_for_model
)
- Plotting the results of various algorithms and an associated hyperparameter + featurization grid for a Nested Cross Validation procedure for Algorithm Selection (see
- Clustering:
- Plotting the Within-Cluster Sum-of-Squares (WCSS) scores for different values of K (i.e. the "Elbow" plot) (see
plot_inertia_scores
) - Plotting the results of clustering data which has been reduced to 2D (e.g. by first applying a dimensionality reduction technique on the data and then applying a cluster algorithm) (see
plot_clustering
)
- Plotting the Within-Cluster Sum-of-Squares (WCSS) scores for different values of K (i.e. the "Elbow" plot) (see
- EDA:
Miscellaneous Utilities
-
eda.py - Utilities for performing non-plotting exploratory data analysis on crisis text datasets, e.g. computing the skew of a distribution (see
compute_distribution_skew
) -
metric_utils.py - Utilities for computing metric scores (F2, AUCPR etc.) by comparing ground truth labels against model predictions and creating scorer functions for a grid search procedure
-
classes.py- Defines classes that are useful across the package:
- Named tuple for structuring metadata for a parameterized input featurization (see
Featurization
class) - Named tuple for storing metadata for an algorithm to be used in a cross-validation or nested cross-validation procedure, e.g. the name of the algorithm,
the hyperparameter + featurization grid to use and the type of normalization to apply to the input featurizations (see
AlgorithmMetadata
) - Named tuple for storing metadata for a fitted model including the
hyperparameters used to train it, the featurization to apply to the input data, the normalization type to apply to the featurized inputs, and the version of the package used to fit the model (see
ModelMetadata
class)
- Named tuple for structuring metadata for a parameterized input featurization (see
-
constants.py - Defines various constants used throughout the package, including:
- Keys and names of defaults, algorithms, featurization types, metadata, filenames, column names of pandas dataframes, etc. for consistent naming in dictionaries, pandas dataframes, and saved metadata (e.g. see
INPUT_COL_NAME
,UNIGRAM_NAME
,ALGO_NAME_KEY
) - Dictionaries mapping the name or types of algorithms, metrics, etc. to their corresponding object type or callable which can be instantiated or called -- these are the models/metrics/cross validation types/etc. which are compatible with the package (e.g. see
METRIC_FUNC_DICT
,NORMALIZER_DICT
,DIMENSIONALITY_REDUCTION_ALGORITHMS
,CLUSTERING_ALGORITHMS
,CROSS_VAL_DICT
,ALGO_DICT
,VECTORIZER_MAP
,HF_BERT_MODELS
, etc. - Sets refering to compatible types of featurizations (see
FEATURIZATIONS_SET
), pretrained tokenizers (seePRETRAINED_TOKENIZERS_SET
), algorithms which do not use random state (seeDOESNT_USE_RANDOM_STATE_SET
), strategies for a dummy classifier (seeDUMMY_CLASSIFIER_STRATEGIES
), etc.
- Keys and names of defaults, algorithms, featurization types, metadata, filenames, column names of pandas dataframes, etc. for consistent naming in dictionaries, pandas dataframes, and saved metadata (e.g. see
-
misc_utils.py - Miscellaneous utilities which are useful across the package for ensuring reproducibility, saving metadata, making tables for latex, etc. (e.g. see
seed_everything
,latexify_num
,save_dict_as_json
, etc.)
For Maintainers
Updating GitLab Repository
To add all modified files, commit those files, push to GitLab repo, and update repo with changes and tag number run:
sh update_repo.sh -t <tag> -m <commit message>
When updating dependencies, make sure to use:
pipenv install <name-of-package>
- Update requirements.txt, run:
sh update_requirements.sh
- Commit & push with the update command above
Adding New Files to Python Package
If you want add a file which contains new functionality, i.e. merits it own file separate from the existing, you must add it to the __init__.py
file, like so:
You can do the following to import specific functions, classes, etc. from the file into the python package. Anything that isn't imported can't be used by the end-user
in __init__.py
(specific imports):
from .name_of_new_file import (
specific_function_you_want_to_import,
specific_class_you_want_to_import,
...
)
...
del name_of_new_file
If you want all functionality from the file to be available to the end-user, do the following:
in __init__.py
(import everything):
from .name_of_new_file import *
...
del name_of_new_file
Publish Package to PyPI
- Launch virtual environment with
pipenv shell
- Install dependencies with
pipenv install
- Run
python setup.py bdist_wheel sdist
. To test, run:sh test.sh
- Run
import url_text_module
-- should give no errors if it's working properly
- Run
twine upload dist/*
. Note: You will need login credentials for the URL PyPI Account in order to publish to PyPI.
Notes
Installing torch with pipenv is a bit of a hassle. This GitHub post was helpful in figuring it out. To install torch with pipenv that have both CPU & GPU capabilities, this needs to be run:
pipenv install --extra-index-url https://download.pytorch.org/whl/cu113/ "torch==1.9.0"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file url_text_module-0.6.0.tar.gz
.
File metadata
- Download URL: url_text_module-0.6.0.tar.gz
- Upload date:
- Size: 73.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d26bb5b919b074b597893c981dfd1d2769eef2e191bdf4ff04accdefcbee9c6f |
|
MD5 | 01e4641300fe1e64afc41107fd1710cb |
|
BLAKE2b-256 | ed661be07bb60371d49d3eccd3c01cb1d075527b09503a6a7d0353ffc7927038 |
File details
Details for the file url_text_module-0.6.0-py3-none-any.whl
.
File metadata
- Download URL: url_text_module-0.6.0-py3-none-any.whl
- Upload date:
- Size: 57.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ef928cf4c01741a7a1705806940d647c48cf1efb12aacc787a729c6eb06a6e5 |
|
MD5 | 09e809f497c999d984a578ced575c7c8 |
|
BLAKE2b-256 | 5a32c811299a1b99da196edf82b4711015b001111d3e80f8730eca09ebd05cef |