Image Module of REACT

These details have not been verified by PyPI

Project links

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Urban Risk Lab (URL) Image Analysis Module

The goal of this module is to utilize effective Convolutional Neural Network (CNN) models to yield efficient and accurate predictions from image data in crowdsourced crisis reports to provide quick categorization that can be used to construct an aggregate summary of the unfolding crisis event. In addition to providing utilities for conducting training and inference and saving those results, it provides subsequent analysis tools for image annotation, model performance, and associated plotting.

This project is compatible with Python version >= 3.6.

Instructions to Install

Using PyPI -- latest version of package on PyPI

pip install url-image-module

Using GitLab Credentials -- using most recent commit

Get .env file by requesting it from url_googleai@mit.edu, use subject headline [Read Credentials URL Image Module GitLab] and your plans for using it.
Load variables into the environment: source <path_to_.env>
run pip install -e git+https://$GITLAB_TOKEN_USER:$GITLAB_TOKEN@gitlab.com/react76/url-image-module.git@master#egg=url-image-module

How to use in Python

At the moment, all classes, constants, and functions can be imported at the root level of the package, like so:

from url_image_module import (
    ...
)

Package Structure & Utilities

This module provides various utilites for conducting reproducible experiments on crowdsourced crisis report images. These utilities include:

Training, Testing, and Prediction with PyTorch Models

training.py - Contains utilities for training a model for multiple epochs on train split image data & validating on a dev split of image data at each epoch. Applies online data augmentation as a form of regularization during training.
testing.py - Contains utilities for testing a trained model on a set of labeled images & saving those test results.
prediction.py - Contains utilities useful for using a trained model to predict on a folder of images located on the host and creating a dataframe to store prediction metadata (i.e. predicted label, prediction scores)
classes.py- Defines classes & helpful functions useful across the package:
- Creating image datasets for prediction (see PredictionImageDataset)
- Instantiating pretrained PyTorch models (pretrained on ImageNet) (see PretrainedImageCNN)
- Dictionary of possible pretrained single label model architectures (see PRETRAINED_MODELS_DICT)
- Dictionary of possible optimizer algorithms (see OPTIMIZER_DICT)
- Helpers for constructing correct architecture of model, loading pretrained weights from a .pt file, and constructing optimizer object with user-specified learning rate & correct weights to update.
constants.py - Defines various constants used throughout the package, including:
- Constants necessary for transforming images prior to being inputted into the model (for training, this includes constants for data augmentation techniques)
- Constants for consistent naming conventions used throughout the package (e.g. TRAIN_SPLIT, DEV_SPLIT, TEST_SPLIT)
- Dictionary containing loss criterions used for training a model (see CRITERION_DICT)
- The various evaluation metrics used for evaluating the performance of a model (see EVALUATION_METRICS_FUNC_DICT)

Operating System, PyTorch, and Pandas Utilities

os_utils.py - Relevant utilities for interacting with the host's operating system, i.e. interacting with the filesystem
- Making/deleting directories
- Copying files
- Extracting filepaths
- Updating filepaths
pd_utils.py - Utilities for interacting with a pandas dataframe (df) including:
- Copying files from one location on host to another using information stored in a df
- Subsetting columns in a df to a user-provided relevant subset
- Cleaning df of empty or partially-empty rows
- Left-joining dfs by filenames
- Saving df as CSV on host's filesystem
pt_utils.py - Utilities for interacting with PyTorch including:
- Naming a file with a proper PyTorch extension (.pt)
- Determining the appropriate device (i.e. CPU or GPU) to put tensors on

Data Labeling & Annotation Analysis Utilities

data_labeling_utils.py - Utilities for conducting annotation efforts & performing interannotator agreement analysis -- agnostic to data type (i.e. works for images & text):
- Creating CSV for annotating a folder of unlabeled data
- Methods for assessing type of agreement on a single data point, i.e. complete agreement, complete disagreement, plurality agreement, etc.
- Methods for computing statistics for a labeled dataset including:
  - Number of unique labels provided for a task
  - Plurality agreement percentage
  - Complete agreement percentage
  - Fleiss' Kappa coefficient
  - Cohen's Kappa coefficient (weighted/unweighted)
- Methods for ground-truthing a dataset i.e. by plurality label
- Methods for wrangling a dataframe of labels (i.e. melting), changing column names to be consistent.
- Methods for reviewing data labeling, i.e. making directory of all data points which had complete agreement, plurality agreement but not complete agreement, etc.
- Methods for reviewing predictions by a model and contrasting it against ground-truth labels.

Plotting Utilities

plotting_utils.py - Utilities for producing visualizations for conducting analysis:
- Generating Confusion Matrices for Classification for visualizing ground-truth labels vs. model predictions
- Plot for visualizing performance of a model on each epoch of training on both train & dev sets, i.e. learning curves
- Plot for visualizing model performance on each class of a task
- EDA Plot on labeled datasets -- useful for visualizing class imbalance prior to modeling
- Plot for Annotation Analysis showing number of images in a dataset which have at least n or more unique annotators who provided a label for the image on that task -- useful for determining a cutoff for interannotator analysis and ground-truthing a dataset

Python Programs -- Data Labeling & Creating ImageFolders

create_image_split_folders.py - Python program which constructs image data folders into train, dev, & test splits using corresponding (CSV, TSV, etc.) which provide filenames and labels for each split and saves these splits to some destination folder on the host.
make_image_labeling_csv.py - Python program which creates a labeling CSV for various classification tasks using filenames located in a directory on the host.

Miscellaneous Utilities

model_utils.py - Utilities for saving & loading trained models weights and other model metadata (i.e. hyperparameters, training settings, classes for the task, etc.) for future use, constructing the correct architecture for a model, and extracting outputs from a model for analysis
metric_utils.py - Utilities for computing metric scores (Precision, Recall, F1, etc.) & confusion matrices by comparing ground truth labels against model predictions
misc_utils.py - Miscellaneous utilities which are useful across the package (see prettify_underscore_string)

For Maintainers

Updating GitLab Repository

To add all modified files, commit those files, push to GitLab repo, and update repo with changes and tag number run:

sh update.sh -t <tag> -m <commit message>

When updating dependencies, make sure to use:

pipenv install <name-of-package>
Update requirements.txt: pipenv run pip freeze > requirements.txt
Commit & push with the update command above

Adding New Files to Python Package

If you want add a file which contains new functionality, i.e. merits it own file separate from the existing, you must add it to the __init__.py file, like so:

You can do the following to import specific functions, classes, etc. from the file into the python package. Anything that isn't imported can't be used by the end-user

in `init.py` (specific imports):

from .name_of_new_file import (
   specific_function_you_want_to_import,
   specific_class_you_want_to_import,
   ...
)
...
del name_of_new_file

If you want all functionality from the file to be available to the end-user, do the following:

in `init.py` (import everything):

from .name_of_new_file import *
...
del name_of_new_file

Publish Package to PyPI

Launch virtual environment with pipenv shell
Install dependencies with pipenv install
Run python setup.py bdist_wheel sdist. To test, run:
1. Run pip install -e .
2. Run python
3. Run import url_image_module -- should give no errors if it's working properly
Run twine upload dist/*. Note: You will need login credentials for the URL PyPI Account in order to publish to PyPI.

Building & Pushing Docker Images on AWS ECR

A. Locally

In order to use this package on AWS infrastructure, we must first build & push docker images. There are two separate Dockerfiles, one for training and the other for inference. Run ./sm-containers/train/make.sh or ./sm-containers/inference/make.sh respectively to build those docker images & push them to AWS ECR. Running these bash scripts will build the images including installing the url-image-module python library as well as upload the built images to ECR where sagemaker can pull them. Make sure to have the correct .env file in the root of the url-image-module repo. Running either make.sh locally will build the container on your host and then push the image to ECR.

B. Using CodeBuild on AWS SageMaker

If you want to build the image & push it to ECR using CodeBuild there's a notebook that can be run on a SageMaker instance which builds the containers using CodeBuild, find this here, i.e. ./sm-containers/make-docker-image.ipynb. If a SageMaker instance does not exist for buliding images, make a new one which has a volume of at least 40Gb, then once inside the instance, select the GitHub icon in the top right corner of the file menu on the left. It will prompt you to provide the HTTPS link to the repo you want to add. You can find this link here under the 'Clone' button. Once you provide the link, it will prompt you to provide credentials, you can find this here under Settings > Repository > Deploy Tokens, it should be the only one with username gitlab+deploy-token-1070133.

Note in either case you will need a .env file with the deploy token credientials in the root of the url-image-module repo when building the containers. Please contact url_googleai@mit.edu to get this .env file.

Notes

Installing torch & torchvision with pipenv is a bit of a hassle. This GitHub post was helpful in figuring it out. To install torch & torchvision with pipenv that have both CPU & GPU capabilities, these need to be run:

pipenv install --extra-index-url https://download.pytorch.org/whl/cu113/ "torch==1.9.0"
pipenv install --extra-index-url https://download.pytorch.org/whl/cu113/ "torchvision==0.10.0"

Project details

These details have not been verified by PyPI

Project links

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.27.0

Jun 9, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url_image_module-0.27.0.tar.gz (73.4 kB view hashes)

Uploaded Jun 9, 2022 Source

Built Distribution

url_image_module-0.27.0-py3-none-any.whl (59.8 kB view hashes)

Uploaded Jun 9, 2022 Python 3

Hashes for url_image_module-0.27.0.tar.gz

Hashes for url_image_module-0.27.0.tar.gz
Algorithm	Hash digest
SHA256	`0aaa8b21f90a1aba16f6054bf0f7cf4b4c238f2386eb594c6329d1582632f5f1`
MD5	`f78f0c42bead353bbe0b9a1053faaf9e`
BLAKE2b-256	`90373be11b8e0e40948aa722e20c7d2e7f2f68a2149f0a004b3f8d6e5168cfef`

Hashes for url_image_module-0.27.0-py3-none-any.whl

Hashes for url_image_module-0.27.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6ee19c09dda03c9c3b026113f4592ed1a8e563eee1bf430ce4db356b176ed51e`
MD5	`eb801552765d75d58afb0ed44d0063f2`
BLAKE2b-256	`c54230ea48255a5a7632e29917dab6b2d77a5e1d8448556fb0d0b5dd49aae272`

url-image-module 0.27.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Urban Risk Lab (URL) Image Analysis Module

Instructions to Install

How to use in Python

Package Structure & Utilities

Training, Testing, and Prediction with PyTorch Models

Operating System, PyTorch, and Pandas Utilities

Data Labeling & Annotation Analysis Utilities

Plotting Utilities

Python Programs -- Data Labeling & Creating ImageFolders

Miscellaneous Utilities

For Maintainers

Updating GitLab Repository

Adding New Files to Python Package

in `init.py` (specific imports):

in `init.py` (import everything):

Publish Package to PyPI

Building & Pushing Docker Images on AWS ECR

A. Locally

B. Using CodeBuild on AWS SageMaker

Notes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

url-image-module 0.27.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Urban Risk Lab (URL) Image Analysis Module

Instructions to Install

How to use in Python

Package Structure & Utilities

Training, Testing, and Prediction with PyTorch Models

Operating System, PyTorch, and Pandas Utilities

Data Labeling & Annotation Analysis Utilities

Plotting Utilities

Python Programs -- Data Labeling & Creating ImageFolders

Miscellaneous Utilities

For Maintainers

Updating GitLab Repository

Adding New Files to Python Package

in __init__.py (specific imports):

in __init__.py (import everything):

Publish Package to PyPI

Building & Pushing Docker Images on AWS ECR

A. Locally

B. Using CodeBuild on AWS SageMaker

Notes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

in `init.py` (specific imports):

in `init.py` (import everything):