Spatial image analysis with pytorch and caffe backends.
Project description
PySpacer
PySpacer (AKA spacer) provides utilities to extract features from random point
locations in images and then train classifiers over those features.
It is used in the vision backend of https://github.com/coralnet/coralnet
.
Spacer currently supports python >=3.8.
Installation
The spacer repo can be installed in three ways.
- Pip install -- for integration with other Python projects.
- Local clone -- ideal for testing and development.
- From Dockerfile -- the only option that supports Caffe, which is used for the legacy feature-extractor.
Config
Setting spacer config variables is only necessary when using certain features. If you don't need S3 storage, and you won't load extractors remotely, you can skip this section.
See CONFIGURABLE_VARS
in config.py
for a full list of available variables, and for an explanation of when each variable must be configured or not.
Spacer's config variables can be set in any of the following ways:
- As environment variables; recommended if you
pip install
the package. Each variable name must be prefixed withSPACER_
:export SPACER_AWS_ACCESS_KEY_ID='YOUR_AWS_KEY_ID'
export SPACER_AWS_SECRET_ACCESS_KEY='YOUR_AWS_SECRET_KEY'
export SPACER_AWS_REGION='us-west-2'
export SPACER_EXTRACTORS_CACHE_DIR='/your/cache'
- In a
secrets.json
file in the same directory as this README; recommended for Docker builds and local clones. Examplesecrets.json
contents:{ "AWS_ACCESS_KEY_ID": "YOUR_AWS_KEY_ID", "AWS_SECRET_ACCESS_KEY": "YOUR_AWS_SECRET_KEY", "AWS_REGION": "us-west-2", "EXTRACTORS_CACHE_DIR": "/your/cache" }
- As a Django setting; recommended for a Django project that uses spacer. Example code in a Django settings module:
SPACER = { 'AWS_ACCESS_KEY_ID': 'YOUR_AWS_KEY_ID', 'AWS_SECRET_ACCESS_KEY': 'YOUR_AWS_SECRET_KEY', 'AWS_REGION': 'us-west-2', 'EXTRACTORS_CACHE_DIR': '/your/cache', }
Spacer supports the following schemes of using multiple settings sources:
- Check environment variables first, then use secrets.json as a fallback.
- Check environment variables first, then use Django settings as a fallback.
However, spacer will not read from multiple file-based settings sources; so if a secrets.json file is present, then spacer will not check for Django settings as a fallback.
To debug your configuration, try opening a Python shell and run from spacer import config
, then config.check()
.
Docker build
The docker build is used in coralnet's deployment.
- Install docker on your system
- Clone this repo to a local folder; let's say it's
/your/local/pyspacer
- Set up configuration as detailed above.
- Choose a local folder for caching extractor files; let's say it's
/your/local/cache
- Build image:
docker build -f /your/local/pyspacer/Dockerfile -t myimagename
- Run:
docker run -v /your/local/cache:/workspace/cache -v /your/local/pyspacer:/workspace/spacer -it myimagename
- The
-v /your/local/cache:/workspace/cache
part ensures that all build attempts use the same cache folder of your host storage. - The
-v /your/local/pyspacer:/workspace/spacer
mounts your local spacer clone (includingsecrets.json
, if used) so that the container has the right permissions. - Overall, this runs the default CMD command specified in the dockerfile
(unit-test with coverage). If you want to enter the docker container,
run the same command but append
bash
in the end.
- The
Pip install
pip install pyspacer
- Set up configuration.
secrets.json
isn't an option here, so either use environment variables, or Django settings if you have a Django project.
Local clone
- Clone this repo
- If using Windows: turn Git's
autocrlf
setting off before your initial checkout. Otherwise, pickled classifiers inspacer/tests/fixtures
will get checked out with\r\n
newlines, and the pickle module will fail to load them, leading to test failures. However, autocrlf should be left on when adding any new non-pickle files.
- If using Windows: turn Git's
pip install -r requirements.txt
- Set up configuration
Code overview
Spacer executes tasks as defined in messages. The message types are defined
in messages.py
and the tasks in tasks.py
. Several data types which can be used for input and output serialization are defined
in data_classes.py
.
For examples on how to create spacer tasks, refer to the Core API section below, and the unit tests in test_tasks.py
.
Tasks can be executed directly by calling the methods in tasks.py.
However, spacer also supports an interface with AWS Batch
handled by env_job()
in mailman.py
.
Spacer supports four storage types: s3
, filesystem
, memory
and url
.
Refer to storage.py
for details. The memory storage is mostly used for
testing, and the url storage is read only.
config.py
defines configurable variables/settings and various constants.
Feature extractors
The first step when analyzing an image, or preparing an image as training data, is extracting features from the image. Therefore, you need a feature extractor to use spacer, but spacer does not provide one out of the box.
Spacer's extract_features.py
provides the Python classes EfficientNetExtractor
for loading EfficientNet extractors in PyTorch format (CoralNet 1.0's default extraction scheme), and VGG16CaffeExtractor
for loading VGG16 extractors in Caffe format (CoralNet's legacy extraction scheme).
You'll either want to match one of these schemes so you can use the provided classes, or you'll have to write your own extractor class which inherits from the base class FeatureExtractor
. Between the provided classes, the easier one to use will probably be EfficientNetExtractor
, because Caffe is old software which is more complicated to install.
If you're loading the extractor files remotely (from S3 or from a URL), the files will be automatically cached to your configured EXTRACTORS_CACHE_DIR
for faster subsequent loads.
Core API
The tasks.py
module has four functions which comprise the main interface of pyspacer:
extract_features
Takes an image, a list of pixel locations on that image, and a feature extractor. Produces a single feature vector out of the image-data at those pixel locations. Example:
from spacer.extract_features import EfficientNetExtractor
from spacer.messages import DataLocation, ExtractFeaturesMsg
from spacer.tasks import extract_features
message = ExtractFeaturesMsg(
# This token is purely for your bookkeeping; you may find it useful if you
# choose to track tasks by saving these task messages. For example, you
# can make the token something that uniquely identifies the input image.
job_token='image1',
# Instantiated feature extractor. Each extractor class defines the
# data_locations which must be specified. In EfficientNetExtractor's case,
# a PyTorch 'weights' file is required.
extractor=EfficientNetExtractor(
data_locations=dict(
weights=DataLocation('filesystem', '/path/to/weights.pt'),
),
),
# (row, column) tuples specifying pixel locations in the image.
# Note that row is y, column is x.
rowcols=[(2200, 1000), (1400, 1500), (3000, 450)],
# Where the input image should be read from.
image_loc=DataLocation('filesystem', '/path/to/image1.jpg'),
# Where the feature vector should be output to.
# CoralNet uses a custom .featurevector extension for these, but the
# format is just JSON.
feature_loc=DataLocation('filesystem', '/path/to/image1.featurevector'),
)
return_message = extract_features(message)
print("Feature vector stored at: /path/to/image1.featurevector")
print(f"Extraction runtime: {return_message.runtime:.1f} s")
train_classifier
Takes:
- Feature vectors, each vector corresponding to a set of pixel locations in one image
- Ground-truth (typically human-confirmed) annotations corresponding to those feature vectors
- Optionally, previously-created classifiers to re-evaluate with these annotations
- Training parameters
Produces a classifier (model) loadable in scikit-learn, and classifier evaluation results. Example:
from spacer.data_classes import ImageLabels
from spacer.messages import DataLocation, TrainClassifierMsg
from spacer.tasks import train_classifier
message = TrainClassifierMsg(
# For your bookkeeping.
job_token='classifier1',
# 'minibatch' is currently the only trainer that spacer defines.
trainer_name='minibatch',
# How many iterations the training algorithm should run; more epochs
# = more opportunity to converge to a better fit, but slower.
nbr_epochs=10,
# Classifier types available:
# 1. 'MLP': multi-layer perceptron; newer classifier type for CoralNet
# 2. 'LR': logistic regression; older classifier type for CoralNet
clf_type='MLP',
# Point-locations to ground-truth-labels (annotations) mapping. Used for
# training the classifier.
# The data dict-keys must be the same as the `key` used in the
# extract-features task's `feature_loc`.
# The data dict-values are lists of tuples of
# (row, column, label ID). You'll need to be tracking a mapping of
# integer label IDs to the labels you use.
train_labels=ImageLabels(data={
'/path/to/image1.featurevector': [(1000, 2000, 1), (3000, 2000, 2)],
'/path/to/image2.featurevector': [(1000, 2000, 3), (3000, 2000, 1)],
'/path/to/image3.featurevector': [(1234, 2857, 11), (3094, 2262, 25)],
}),
# Point-locations to ground-truth-labels mapping. Used for evaluating
# the classifier's accuracy. Should be disjoint from train_labels.
# CoralNet uses a 7-to-1 ratio of train_labels to val_labels.
val_labels=ImageLabels(data={
'/path/to/image4.featurevector': [(500, 2500, 1), (2500, 1500, 3)],
'/path/to/image5.featurevector': [(4321, 5582, 25), (4903, 2622, 19)],
}),
# All the feature vectors should use the same storage_type, and the same
# S3 bucket_name if applicable. This DataLocation's purpose is to describe
# those common storage details. The key arg is ignored, because that will
# be different for each feature vector.
features_loc=DataLocation('filesystem', ''),
# List of previously-created models (classifiers) to also evaluate
# using this dataset, for informational purposes only.
# A classifier is stored as a pickled CalibratedClassifierCV.
previous_model_locs=[
DataLocation('filesystem', '/path/to/oldclassifier1.pkl'),
DataLocation('filesystem', '/path/to/oldclassifier2.pkl'),
],
# Where the new model (classifier) should be output to.
model_loc=DataLocation('filesystem', '/path/to/classifier1.pkl'),
# Where the detailed evaluation results of the new model should be stored.
valresult_loc=DataLocation('filesystem', '/path/to/valresult.json'),
)
return_message = train_classifier(message)
print("Classifier stored at: /path/to/classifier1.pkl")
print("Evaluation results stored at: /path/to/valresult.json")
print(f"New model's accuracy (0.0 = 0%, 1.0 = 100%): {return_message.acc}")
print(f"Previous models' accuracies: {return_message.pc_accs}")
print(
"New model's accuracy progression (calculated on part of train_labels)"
f" after each epoch of training: {return_message.ref_accs}")
print(f"Training runtime: {return_message.runtime:.1f} s")
Evaluation results consist of three arrays:
gt
: Ground-truth label IDs (which were passed in asval_labels
) for each point.est
: Estimated (classifier-predicted) label IDs for each point.scores
: Classifier's confidence scores (0.0 = 0%, 1.0 = 100%) for each estimated label ID.
The ith element of gt
, ith element of est
, and ith element of scores
correspond to each other. But the elements are otherwise in an undefined order.
Accuracy is defined as the percentage of gt
labels that match the corresponding est
labels.
classify_features
Takes a feature vector (representing points in an image) to classify, and a classifier trained on the same type of features (EfficientNet or VGG16). Produces prediction results (scores) for the image points, as posterior probabilities for each class. Example:
from spacer.messages import DataLocation, ClassifyFeaturesMsg
from spacer.tasks import classify_features
message = ClassifyFeaturesMsg(
# For your bookkeeping.
job_token='image1',
# Where the input feature-vector should be read from.
feature_loc=DataLocation('filesystem', '/path/to/image1.featurevector'),
# Where the classifier should be read from.
classifier_loc=DataLocation('filesystem', '/path/to/classifier1.pkl'),
)
return_message = classify_features(message)
print(f"Classification runtime: {return_message.runtime:.1f} s")
print(f"Classes (recognized labels): {return_message.classes}")
print(
"Classifier's scores for each point in the feature vector;"
" scores are posterior probabilities of each class, with classes"
" ordered as above:")
for row, col, scores in return_message.scores:
print(f"Row {row}, column {col}: {scores}")
The label which has the highest score for a particular point (row-column position) can be considered the classifier's predicted label for that point.
One possible usage strategy is to trust the classifier's predictions for points where the highest confidence score is above a certain threshold, such as 0.8 (80%), and have human annotators check all other points.
classify_image
This basically does extract_features
and classify_features
together in one go, without needing to specify a storage location for the feature vector.
Takes an image, a list of pixel locations on that image, a feature extractor, and a classifier. Produces prediction results (scores) for the image points, as posterior probabilities for each class. Example:
from spacer.extract_features import EfficientNetExtractor
from spacer.messages import DataLocation, ClassifyImageMsg
from spacer.tasks import classify_image
message = ClassifyImageMsg(
# For your bookkeeping.
job_token='image1',
# Where the input image should be read from.
image_loc=DataLocation('filesystem', '/path/to/image1.jpg'),
# Instantiated feature extractor.
extractor=EfficientNetExtractor(
data_locations=dict(
weights=DataLocation('filesystem', '/path/to/weights.pt'),
),
),
# (row, column) tuples specifying pixel locations in the image.
# Note that row is y, column is x.
rowcols=[(2200, 1000), (1400, 1500), (3000, 450)],
# Where the classifier should be read from.
classifier_loc=DataLocation('filesystem', '/path/to/classifier1.pkl'),
)
return_message = classify_image(message)
print(f"Runtime: {return_message.runtime:.1f} s")
print(f"Classes (recognized labels): {return_message.classes}")
print(
"Classifier's scores for each point in rowcols;"
" scores are posterior probabilities of each class, with classes"
" ordered as above:")
for row, col, scores in return_message.scores:
print(f"Row {row}, column {col}: {scores}")
Code coverage
If you are using the docker build or local install, you can check code coverage like so:
coverage run --source=spacer --omit=spacer/tests/* -m unittest
coverage report -m
coverage html
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.