Image Module of REACT
Project description
Urban Risk Lab (URL) Image Analysis Module
The goal of this module is to utilize effective Convolutional Neural Network (CNN) models to yield efficient and accurate predictions from image data in crowdsourced crisis reports to provide quick categorization that can be used to construct an aggregate summary of the unfolding crisis event. In addition to providing utilities for conducting training and inference and saving those results, it provides subsequent analysis tools for image annotation, model performance, and associated plotting.
This project is compatible with Python version >= 3.6.
Instructions to Install
Using PyPI -- latest version of package on PyPI
pip install url-image-module
Using GitLab Credentials -- using most recent commit
- Get
.env
file by requesting it from url_googleai@mit.edu, use subject headline[Read Credentials URL Image Module GitLab]
and your plans for using it. - Load variables into the environment:
source <path_to_.env>
- run
pip install -e git+https://$GITLAB_TOKEN_USER:$GITLAB_TOKEN@gitlab.com/react76/url-image-module.git@master#egg=url-image-module
How to use in Python
At the moment, all classes, constants, and functions can be imported at the root level of the package, like so:
from url_image_module import (
...
)
Package Structure & Utilities
This module provides various utilites for conducting reproducible experiments on crowdsourced crisis report images. These utilities include:
Training, Testing, and Prediction with PyTorch Models
-
training.py - Contains utilities for training a model for multiple epochs on train split image data & validating on a dev split of image data at each epoch. Applies online data augmentation as a form of regularization during training.
-
testing.py - Contains utilities for testing a trained model on a set of labeled images & saving those test results.
-
prediction.py - Contains utilities useful for using a trained model to predict on a folder of images located on the host and creating a dataframe to store prediction metadata (i.e. predicted label, prediction scores)
-
classes.py- Defines classes & helpful functions useful across the package:
- Creating image datasets for prediction (see
PredictionImageDataset
) - Instantiating pretrained PyTorch models (pretrained on ImageNet) (see
PretrainedImageCNN
) - Dictionary of possible pretrained single label model architectures (see
PRETRAINED_MODELS_DICT
) - Dictionary of possible optimizer algorithms (see
OPTIMIZER_DICT
) - Helpers for constructing correct architecture of model, loading pretrained weights from a .pt file, and constructing optimizer object with user-specified learning rate & correct weights to update.
- Creating image datasets for prediction (see
-
constants.py - Defines various constants used throughout the package, including:
- Constants necessary for transforming images prior to being inputted into the model (for training, this includes constants for data augmentation techniques)
- Constants for consistent naming conventions used throughout the package (e.g.
TRAIN_SPLIT
,DEV_SPLIT
,TEST_SPLIT
) - Dictionary containing loss criterions used for training a model (see
CRITERION_DICT
) - The various evaluation metrics used for evaluating the performance of a model (see
EVALUATION_METRICS_FUNC_DICT
)
Operating System, PyTorch, and Pandas Utilities
-
os_utils.py - Relevant utilities for interacting with the host's operating system, i.e. interacting with the filesystem
- Making/deleting directories
- Copying files
- Extracting filepaths
- Updating filepaths
-
pd_utils.py - Utilities for interacting with a pandas dataframe (df) including:
- Copying files from one location on host to another using information stored in a df
- Subsetting columns in a df to a user-provided relevant subset
- Cleaning df of empty or partially-empty rows
- Left-joining dfs by filenames
- Saving df as CSV on host's filesystem
-
pt_utils.py - Utilities for interacting with PyTorch including:
- Naming a file with a proper PyTorch extension (.pt)
- Determining the appropriate device (i.e. CPU or GPU) to put tensors on
Data Labeling & Annotation Analysis Utilities
- data_labeling_utils.py - Utilities for conducting annotation efforts & performing interannotator agreement analysis -- agnostic to data type (i.e. works for images & text):
- Creating CSV for annotating a folder of unlabeled data
- Methods for assessing type of agreement on a single data point, i.e. complete agreement, complete disagreement, plurality agreement, etc.
- Methods for computing statistics for a labeled dataset including:
- Number of unique labels provided for a task
- Plurality agreement percentage
- Complete agreement percentage
- Fleiss' Kappa coefficient
- Cohen's Kappa coefficient (weighted/unweighted)
- Methods for ground-truthing a dataset i.e. by plurality label
- Methods for wrangling a dataframe of labels (i.e. melting), changing column names to be consistent.
- Methods for reviewing data labeling, i.e. making directory of all data points which had complete agreement, plurality agreement but not complete agreement, etc.
- Methods for reviewing predictions by a model and contrasting it against ground-truth labels.
Plotting Utilities
- plotting_utils.py - Utilities for producing visualizations for conducting analysis:
- Generating Confusion Matrices for Classification for visualizing ground-truth labels vs. model predictions
- Plot for visualizing performance of a model on each epoch of training on both train & dev sets, i.e. learning curves
- Plot for visualizing model performance on each class of a task
- EDA Plot on labeled datasets -- useful for visualizing class imbalance prior to modeling
- Plot for Annotation Analysis showing number of images in a dataset which have at least n or more unique annotators who provided a label for the image on that task -- useful for determining a cutoff for interannotator analysis and ground-truthing a dataset
Python Programs -- Data Labeling & Creating ImageFolders
-
create_image_split_folders.py - Python program which constructs image data folders into train, dev, & test splits using corresponding (CSV, TSV, etc.) which provide filenames and labels for each split and saves these splits to some destination folder on the host.
-
make_image_labeling_csv.py - Python program which creates a labeling CSV for various classification tasks using filenames located in a directory on the host.
Miscellaneous Utilities
-
model_utils.py - Utilities for saving & loading trained models weights and other model metadata (i.e. hyperparameters, training settings, classes for the task, etc.) for future use, constructing the correct architecture for a model, and extracting outputs from a model for analysis
-
metric_utils.py - Utilities for computing metric scores (Precision, Recall, F1, etc.) & confusion matrices by comparing ground truth labels against model predictions
-
misc_utils.py - Miscellaneous utilities which are useful across the package (see
prettify_underscore_string
)
For Maintainers
Updating GitLab Repository
To add all modified files, commit those files, push to GitLab repo, and update repo with changes and tag number run:
sh update.sh -t <tag> -m <commit message>
When updating dependencies, make sure to use:
pipenv install <name-of-package>
- Update requirements.txt:
pipenv run pip freeze > requirements.txt
- Commit & push with the update command above
Adding New Files to Python Package
If you want add a file which contains new functionality, i.e. merits it own file separate from the existing, you must add it to the __init__.py
file, like so:
You can do the following to import specific functions, classes, etc. from the file into the python package. Anything that isn't imported can't be used by the end-user
in __init__.py
(specific imports):
from .name_of_new_file import (
specific_function_you_want_to_import,
specific_class_you_want_to_import,
...
)
...
del name_of_new_file
If you want all functionality from the file to be available to the end-user, do the following:
in __init__.py
(import everything):
from .name_of_new_file import *
...
del name_of_new_file
Publish Package to PyPI
- Launch virtual environment with
pipenv shell
- Install dependencies with
pipenv install
- Run
python setup.py bdist_wheel sdist
. To test, run:- Run
pip install -e .
- Run
python
- Run
import url_image_module
-- should give no errors if it's working properly
- Run
- Run
twine upload dist/*
. Note: You will need login credentials for the URL PyPI Account in order to publish to PyPI.
Building & Pushing Docker Images on AWS ECR
A. Locally
In order to use this package on AWS infrastructure, we must first build & push docker images. There are two separate Dockerfiles, one for
training and the other for inference. Run
./sm-containers/train/make.sh
or
./sm-containers/inference/make.sh
respectively to build
those docker images & push them to AWS ECR. Running these bash scripts will build the images including installing the url-image-module python library
as well as upload the built images to ECR where sagemaker can pull them. Make sure to have the correct .env
file in the root of the url-image-module repo. Running either make.sh
locally will build the container on your host and then push the image to ECR.
B. Using CodeBuild on AWS SageMaker
If you want to build the image & push it to ECR using CodeBuild there's a notebook that can be run on a SageMaker instance which builds the containers using CodeBuild, find this here, i.e. ./sm-containers/make-docker-image.ipynb
. If a SageMaker instance does not exist for buliding images, make a new one which has a volume of at least 40Gb, then once inside the instance, select the GitHub icon in the top right corner of the file menu on the left. It will prompt you to provide the HTTPS link to the repo you want to add. You can find this link here under the 'Clone' button. Once you provide the link, it will prompt you to provide credentials, you can find this here under Settings > Repository > Deploy Tokens
, it should be the only one with username gitlab+deploy-token-1070133
.
Note in either case you will need a .env
file with the deploy token credientials in the root of the url-image-module repo when building the containers. Please contact url_googleai@mit.edu to get this .env
file.
Notes
Installing torch & torchvision with pipenv is a bit of a hassle. This GitHub post was helpful in figuring it out. To install torch & torchvision with pipenv that have both CPU & GPU capabilities, these need to be run:
pipenv install --extra-index-url https://download.pytorch.org/whl/cu113/ "torch==1.9.0"
pipenv install --extra-index-url https://download.pytorch.org/whl/cu113/ "torchvision==0.10.0"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file url_image_module-0.27.0.tar.gz
.
File metadata
- Download URL: url_image_module-0.27.0.tar.gz
- Upload date:
- Size: 73.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0aaa8b21f90a1aba16f6054bf0f7cf4b4c238f2386eb594c6329d1582632f5f1 |
|
MD5 | f78f0c42bead353bbe0b9a1053faaf9e |
|
BLAKE2b-256 | 90373be11b8e0e40948aa722e20c7d2e7f2f68a2149f0a004b3f8d6e5168cfef |
File details
Details for the file url_image_module-0.27.0-py3-none-any.whl
.
File metadata
- Download URL: url_image_module-0.27.0-py3-none-any.whl
- Upload date:
- Size: 59.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ee19c09dda03c9c3b026113f4592ed1a8e563eee1bf430ce4db356b176ed51e |
|
MD5 | eb801552765d75d58afb0ed44d0063f2 |
|
BLAKE2b-256 | c54230ea48255a5a7632e29917dab6b2d77a5e1d8448556fb0d0b5dd49aae272 |