Skip to main content

Image Module of REACT

Project description

Urban Risk Lab (URL) Image Analysis Module

The goal of this module is to utilize effective Convolutional Neural Network (CNN) models to yield efficient and accurate predictions from image data in crowdsourced crisis reports to provide quick categorization that can be used to construct an aggregate summary of the unfolding crisis event. In addition to providing utilities for conducting training and inference and saving those results, it provides subsequent analysis tools for image annotation, model performance, and associated plotting.

This project is compatible with Python version >= 3.6.

Instructions to Install

Using PyPI -- latest version of package on PyPI

pip install url-image-module

Using GitLab Credentials -- using most recent commit

  1. Get .env file by requesting it from url_googleai@mit.edu, use subject headline [Read Credentials URL Image Module GitLab] and your plans for using it.
  2. Load variables into the environment: source <path_to_.env>
  3. run pip install -e git+https://$GITLAB_TOKEN_USER:$GITLAB_TOKEN@gitlab.com/react76/url-image-module.git@master#egg=url-image-module

How to use in Python

At the moment, all classes, constants, and functions can be imported at the root level of the package, like so:

from url_image_module import (
    ...
)

Package Structure & Utilities

This module provides various utilites for conducting reproducible experiments on crowdsourced crisis report images. These utilities include:

Training, Testing, and Prediction with PyTorch Models
  • training.py - Contains utilities for training a model for multiple epochs on train split image data & validating on a dev split of image data at each epoch. Applies online data augmentation as a form of regularization during training.

  • testing.py - Contains utilities for testing a trained model on a set of labeled images & saving those test results.

  • prediction.py - Contains utilities useful for using a trained model to predict on a folder of images located on the host and creating a dataframe to store prediction metadata (i.e. predicted label, prediction scores)

  • classes.py- Defines classes & helpful functions useful across the package:

    • Creating image datasets for prediction (see PredictionImageDataset)
    • Instantiating pretrained PyTorch models (pretrained on ImageNet) (see PretrainedImageCNN)
    • Dictionary of possible pretrained single label model architectures (see PRETRAINED_MODELS_DICT)
    • Dictionary of possible optimizer algorithms (see OPTIMIZER_DICT)
    • Helpers for constructing correct architecture of model, loading pretrained weights from a .pt file, and constructing optimizer object with user-specified learning rate & correct weights to update.
  • constants.py - Defines various constants used throughout the package, including:

    • Constants necessary for transforming images prior to being inputted into the model (for training, this includes constants for data augmentation techniques)
    • Constants for consistent naming conventions used throughout the package (e.g. TRAIN_SPLIT, DEV_SPLIT, TEST_SPLIT)
    • Dictionary containing loss criterions used for training a model (see CRITERION_DICT)
    • The various evaluation metrics used for evaluating the performance of a model (see EVALUATION_METRICS_FUNC_DICT)
Operating System, PyTorch, and Pandas Utilities
  • os_utils.py - Relevant utilities for interacting with the host's operating system, i.e. interacting with the filesystem

    • Making/deleting directories
    • Copying files
    • Extracting filepaths
    • Updating filepaths
  • pd_utils.py - Utilities for interacting with a pandas dataframe (df) including:

    • Copying files from one location on host to another using information stored in a df
    • Subsetting columns in a df to a user-provided relevant subset
    • Cleaning df of empty or partially-empty rows
    • Left-joining dfs by filenames
    • Saving df as CSV on host's filesystem
  • pt_utils.py - Utilities for interacting with PyTorch including:

    • Naming a file with a proper PyTorch extension (.pt)
    • Determining the appropriate device (i.e. CPU or GPU) to put tensors on
Data Labeling & Annotation Analysis Utilities
  • data_labeling_utils.py - Utilities for conducting annotation efforts & performing interannotator agreement analysis -- agnostic to data type (i.e. works for images & text):
    • Creating CSV for annotating a folder of unlabeled data
    • Methods for assessing type of agreement on a single data point, i.e. complete agreement, complete disagreement, plurality agreement, etc.
    • Methods for computing statistics for a labeled dataset including:
      • Number of unique labels provided for a task
      • Plurality agreement percentage
      • Complete agreement percentage
      • Fleiss' Kappa coefficient
      • Cohen's Kappa coefficient (weighted/unweighted)
    • Methods for ground-truthing a dataset i.e. by plurality label
    • Methods for wrangling a dataframe of labels (i.e. melting), changing column names to be consistent.
    • Methods for reviewing data labeling, i.e. making directory of all data points which had complete agreement, plurality agreement but not complete agreement, etc.
    • Methods for reviewing predictions by a model and contrasting it against ground-truth labels.
Plotting Utilities
  • plotting_utils.py - Utilities for producing visualizations for conducting analysis:
    • Generating Confusion Matrices for Classification for visualizing ground-truth labels vs. model predictions
    • Plot for visualizing performance of a model on each epoch of training on both train & dev sets, i.e. learning curves
    • Plot for visualizing model performance on each class of a task
    • EDA Plot on labeled datasets -- useful for visualizing class imbalance prior to modeling
    • Plot for Annotation Analysis showing number of images in a dataset which have at least n or more unique annotators who provided a label for the image on that task -- useful for determining a cutoff for interannotator analysis and ground-truthing a dataset
Python Programs -- Data Labeling & Creating ImageFolders
  • create_image_split_folders.py - Python program which constructs image data folders into train, dev, & test splits using corresponding (CSV, TSV, etc.) which provide filenames and labels for each split and saves these splits to some destination folder on the host.

  • make_image_labeling_csv.py - Python program which creates a labeling CSV for various classification tasks using filenames located in a directory on the host.

Miscellaneous Utilities
  • model_utils.py - Utilities for saving & loading trained models weights and other model metadata (i.e. hyperparameters, training settings, classes for the task, etc.) for future use, constructing the correct architecture for a model, and extracting outputs from a model for analysis

  • metric_utils.py - Utilities for computing metric scores (Precision, Recall, F1, etc.) & confusion matrices by comparing ground truth labels against model predictions

  • misc_utils.py - Miscellaneous utilities which are useful across the package (see prettify_underscore_string)

For Maintainers

Updating GitLab Repository

To add all modified files, commit those files, push to GitLab repo, and update repo with changes and tag number run:

sh update.sh -t <tag> -m <commit message>

When updating dependencies, make sure to use:

  1. pipenv install <name-of-package>
  2. Update requirements.txt: pipenv run pip freeze > requirements.txt
  3. Commit & push with the update command above

Adding New Files to Python Package

If you want add a file which contains new functionality, i.e. merits it own file separate from the existing, you must add it to the __init__.py file, like so:

You can do the following to import specific functions, classes, etc. from the file into the python package. Anything that isn't imported can't be used by the end-user

in __init__.py (specific imports):
from .name_of_new_file import (
   specific_function_you_want_to_import,
   specific_class_you_want_to_import,
   ...
)
...
del name_of_new_file

If you want all functionality from the file to be available to the end-user, do the following:

in __init__.py (import everything):
from .name_of_new_file import *
...
del name_of_new_file

Publish Package to PyPI

  1. Launch virtual environment with pipenv shell
  2. Install dependencies with pipenv install
  3. Run python setup.py bdist_wheel sdist. To test, run:
    1. Run pip install -e .
    2. Run python
    3. Run import url_image_module -- should give no errors if it's working properly
  4. Run twine upload dist/*. Note: You will need login credentials for the URL PyPI Account in order to publish to PyPI.

Building & Pushing Docker Images on AWS ECR

A. Locally

In order to use this package on AWS infrastructure, we must first build & push docker images. There are two separate Dockerfiles, one for training and the other for inference. Run ./sm-containers/train/make.sh or ./sm-containers/inference/make.sh respectively to build those docker images & push them to AWS ECR. Running these bash scripts will build the images including installing the url-image-module python library as well as upload the built images to ECR where sagemaker can pull them. Make sure to have the correct .env file in the root of the url-image-module repo. Running either make.sh locally will build the container on your host and then push the image to ECR.

B. Using CodeBuild on AWS SageMaker

If you want to build the image & push it to ECR using CodeBuild there's a notebook that can be run on a SageMaker instance which builds the containers using CodeBuild, find this here, i.e. ./sm-containers/make-docker-image.ipynb. If a SageMaker instance does not exist for buliding images, make a new one which has a volume of at least 40Gb, then once inside the instance, select the GitHub icon in the top right corner of the file menu on the left. It will prompt you to provide the HTTPS link to the repo you want to add. You can find this link here under the 'Clone' button. Once you provide the link, it will prompt you to provide credentials, you can find this here under Settings > Repository > Deploy Tokens, it should be the only one with username gitlab+deploy-token-1070133.

Note in either case you will need a .env file with the deploy token credientials in the root of the url-image-module repo when building the containers. Please contact url_googleai@mit.edu to get this .env file.

Notes

Installing torch & torchvision with pipenv is a bit of a hassle. This GitHub post was helpful in figuring it out. To install torch & torchvision with pipenv that have both CPU & GPU capabilities, these need to be run:

  1. pipenv install --extra-index-url https://download.pytorch.org/whl/cu113/ "torch==1.9.0"
  2. pipenv install --extra-index-url https://download.pytorch.org/whl/cu113/ "torchvision==0.10.0"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url_image_module-0.27.0.tar.gz (73.4 kB view hashes)

Uploaded Source

Built Distribution

url_image_module-0.27.0-py3-none-any.whl (59.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page