contentai-activity-classifier

ContentAI Activity Classification Service

Project description

activity-classifier-extractor

Generates activity classifications from low-level feature inputs in support of analytic workflows within the ContentAI Platform, published as the extractor dsai_activity_classifier.

Getting Started
Execution
Creating Models
Testing
Future Development
Changes

Getting Started

This library is used as a single-run executable.

Runtime parameters can be passed for processing that configure the returned results and can be examined in more detail in the main script.

verbose - (bool) - verbose input/output configuration printing (default=false)
path_content - (str) - input video path for files to label (default=video.mp4)
path_result - (str) - output path for samples (default=.)
path_models - (str) - manifest path for model information (default=data/models/manifest.json)
time_interval - (float) - time interval for predictions from models (default=3.0)
average_predictions - (bool) - flatten predictions across time and class (default=false)
round_decimals - (int) - rounding decimals for predictions (default=5)
score_min - (float) - apply a minimum score threshold for classes (default=0.1)

dependencies

To install package dependencies in a fresh system, the recommended technique is a set of

vanilla pip packages. The latest requirements should be validated from the requirements.txt file but at time of writing, they were the following.

pip install --no-cache-dir -r requirements.txt

Execution and Deployment

This package is meant to be run as a one-off processing tool that aggregates the insights of other extractors.

command-line standalone

Run the code as if it is an extractor. In this mode, configure a few environment variables to let the code know where to look for content.

One can also run the command-line with a single argument as input and optionally ad runtime configuration (see runtime variables) as part of the EXTRACTOR_METADATA variable as JSON.

EXTRACTOR_METADATA='{"compressed":True}'

Locally Run Classifier on Results

For utility, the above line has been wrapped in the bash script run_local.sh.

./run_local.sh <docker_image> [<source_directory> <output_data_dir> [<json_args>]] [<all_args>]
   - run clip extraction on source with prior processing

  <docker_image> = 0 IF local command-line based (args using arg parse)
                 = 1 IF local docker emulation
                 = IMAGE_NAME IF docker image name to run

  ./run_local.sh 0 --path_content features/ --path_result results/ --verbose
  ./run_local.sh 1 features/ results/ 0 '{\"verbose\"true}'

Through all of the above examples, the underlying command-line execution is similar to this excution run on the testing data.

python -u activity_classifier/main.py --path_content testing/data/launch/video.mp4
        --path_result testing/class --path_models activity_classifier/data/models/manifest.json --verbose

Feature-Based Similarity

A helper script is also avaialble to compute the similarity of clips in one or more feature files. (v1.1.0)

python -u activity_classifier/features.py --path_content testing/data/dummy.txt \\
        --feature_type dsai_videocnn dsai_vggish --path_result testing/dist

ContentAI

Deployment

Deployment is easy and follows standard ContentAI steps.

contentai deploy dsai_activity_classifier
Deploying...
writing workflow.dot
done

Alternatively, you can pass an image name to reduce rebuilding a docker instance.

docker build -t dsai_activity_classifier
contentai deploy metadata-flatten dsai_activity_classifier

Locally Downloading Results

You can locally download data from a specific job for this extractor to directly analyze.

contentai data wHaT3ver1t1s --dir data

Run as an Extractor

contentai run https://bucket/video.mp4  -w 'digraph { dsai_videocnn -> dsai_activity_classifier; dsai_vggish -> dsai_activity_classifier }'

JOB ID:     1Tfb1vPPqTQ0lVD1JDPUilB8QNr
CONTENT:    s3://bucket/video.mp4
STATE:      complete
START:      Fri Feb 15 04:38:05 PM (6 minutes ago)
UPDATED:    1 minute ago
END:        Fri Feb 15 04:43:04 PM (1 minute ago)
DURATION:   4 minutes

EXTRACTORS

my_extractor

TASK      STATE      START           DURATION
724a493   complete   5 minutes ago   1 minute

Or run it via the docker image. Please review the run_local.sh file for more information.

View Extractor Logs (stdout)

contentai logs -f <my_extractor>
my_extractor Fri Nov 15 04:39:22 PM writing some data
Job complete in 4m58.265737799s

Creating Models

There are two steps to adding new models.

First, training the models and formulating a well-known structure (this can be done exhaustively across a number of model types).
Update the manifest according to the structrure above that indicates how the activity classifier should load the model (e.g. the framework), the required features, and a few fields for understanding other descriptions (e.g. the name and the id).

Exhaustive Training

To ease standarized training across differnet models. Several scripting options have been created and are described below. Coupled with the manifest file described above, one can easily choose the best individual (via cross-validation analysis) for every label or dataset configuration, but combine them all into a unified output with this package.

Binary Models

Also referred to as one-vs-all models, binary models offer only two outputs.

(label format to be described here)

Models may be trained, tested, and saved as follows, using modeling.py.

python modeling.py  -b labelset_1 -l completions/ -v content/video_feature.hdf5 -a content/audio_feature.hdf5
                    -p Logistic-performance.csv -m models/Logistic

The preceding used a LogisticRegression estimator by default, and its default settings. But customized settings may be added directly on the command line. The following is a ‘tuned’ Logistic estimator which has better overall performance.

python modeling.py  -b labelset_1 -l completions/ -v content/video_feature.hdf5 -a content/audio_feature.hdf5
                    -p Logistic_tuned-performance.csv -m models/LogisticTuned
                    --est_params '{"C":4.0,"random_state":0,"max_iter":500,"class_weight":"balanced","solver":"lbfgs"}'

In fact the estimator itself may be specified on the command line as well, and if it already exists (such as those in sklearn) no additional coding is necessary. For example the following uses a Multi-Layer Perceptron estimator.

python modeling.py  -b labelset_1 -l completions/ -v content/video_feature.hdf5 -a content/audio_feature.hdf5
                    -p MLP-performance.csv -m models/MLP  --estimator sklearn.neural_network.MLPClassifier --est_params '{"max_iter":500}'

Custom estimators may need to specify the ‘framework’ so that they can be saved and loaded properly. This example is a keras-based estimator with an sklearn-like wrapper.

python modeling.py  -b labelset_1 -l labels/video/completions/ -v content/video_feature.hdf5 -a content/audio_feature.hdf5
                    -p NL-performance.csv -m models/NL --framework wrapped --estimator nine_layers.NineLayerEstimator

The script collate.py collects the statistics from all the runs and formats them for easier (web) viewing.

python collate.py models

Multiclass Models

Models can be built with other labels using the same framework as above but in a multi-line format.

(format to be described here)

python modeling.py  --label_type multi-csv -l emote_data.csv -a emotions_audio.hdf5 -v emotions_video.hdf5
                    -p Logistic-performance.csv -m models/Logistic

As above, other estimator types may be specified on the command line:

python modeling.py --label_type multi-csv -l emote_data.csv -a emotions_audio.hdf5 -v emotions_video.hdf5
                   -p MLP500-performance.csv -m models/MLP500 --estimator sklearn.neural_network.MLPClassifier --est_params '{"max_iter":500}'

As above, collate.py may be used to format the results.

Adding Model Assets

Adding models to the pre-determined set of models is as easy as editing a manifest file and adding a model into git LFS.

Archive the new model into a serialized fileset. At time of writing, this was serializing models from sklearn with simple pickle load/save serialization.
Gather all of the relevant output files and compress them if you can. Currently, the library understands gzip compression extensions (e.g. “.gz”).
Choose the appropriate sub-directory that corresponds to the upstream feature extractor. For example, models built on 3dcnn features may process new videos (via extractor chaining) to the extractor dsai_3dcnn. If one doesn’t exist yet, please create a new directory, but remember what combination of audio and video features is required.
Modify the manifest file in activity_classifier/data/models/manifest.json for your new entry. Specifically, the input video and audio features must be defined as well as the serialization library. Below is an example block that indicates 3dcnn` video and ``vggish audio features for a model crated with sklearn where prediction results will be nested with the name Running.
```
[ ...
{
    "path": "3dcnn-vggish/lr-Running.pkl.gz",
    "name": "Running",
    "id": "ugc",
    "framework": "sklearn",
    "video": "dsai_videocnn",
    "audio": "dsai_vggish"
},
... ]
```
Prepare to add your model files to the repo. NOTE This repo uses `git-lfs <https://git-lfs.github.com/>`__ to store all binary files like models. If your model is added with regular git tools alone, you will get a sternly worded email (and friendly advice on how to re-add correctly).
```
(from the base directory only)
git lfs track activity_classifier/data/models/3dcnn/moonwalk_model.pkl.gz
git add activity_classifier/data/models/3dcnn/moonwalk_model.pkl.gz
git add activity_classifier/data/models/manifest.json
```
Test your model with the data in the testing directory. The CI/CD process should do this tool but it’s always easier to find and fix problems here than with a vague email. The features in this directory came from processing of the HBO Max Launch Video, which is publicaly available as a reference.
```
(from the base directory)

./run_local.sh 0 --path_content testing/data/test.mp4 --time_interval 1.5

(check for predictions from your new model in data.json)
```

Testing

Testing is included via tox. To launch testing for the entire package, just run tox at the command line. Testing can also be run for a specific file within the package by setting the evironment variable TOX_ARGS.

TOX_ARG=test_basic.py tox

Future Development

additional training hooks?

Changes

Generates activity classifications from low-level feature inputs in support of analytic workflows within the ContentAI Platform.

1.2

1.2.0

switch to package representation, push to pypi
several updates for MANIFEST definition (id)
inclusion of multi-parameter training and testing framework
safety for model loading, catch exceptions, return gracefully
update documents to split for binary models

1.1

1.1.1

cosmetic change for reuse in other libraries

1.1.0

refactor feature code, add utility for difference computation among segments
min value thresholding to avoid low scoring results in output (default=0.1)
refactor caching information for feature load (allow flatten, remove cache, allow multi-asset)
allow recursive feature load for distance compute

1.0

1.0.2

fixes for output, modify to require other extractors as dependencies
fix order of paramters for local runs

1.0.1

updates for integration of other models, fixes for prediction output
add l2norm after average/merge in time of source features

1.0.0

initial project merge from other sources
generates json prediction dict
callable as package
includes some testing routines with windowing comparison

Project details

Release history Release notifications | RSS feed

1.3.7

Aug 18, 2020

1.3.6

Aug 14, 2020

1.3.5

Aug 12, 2020

1.2.2

Aug 8, 2020

1.2.1

Aug 7, 2020

This version