Simple ML pipeline platform
Project description
IrisML
Proof of Concept for a simple framework to create a ML pipeline.
Features
- Run a ML training/inference with a simple JSON configuration.
- Modularized interfaces for task components.
- Cache task outputs for faster experiments.
Getting started
Installation
Prerequisite: python 3.8+
# Install the core framework and standard tasks.
pip install irisml irisml-tasks irisml-tasks-training
Run an example job
# Install additional packages that are required for the example
pip install irisml-tasks-torchvision
# Run on local machine
irisml_run docs/examples/mobilenetv2_mnist_training.json
Available commands
# Run the specified pipeline. You can provide environment variables by "-e" option, which will be acceible through $env variable in the json config.
irisml_run <pipeline_json_path> [-e <ENV_NAME>=<env_value>] [--no_cache] [--no_cache_read] [-v]
# Show information about the specified task. If <task_name> is not provided, shows a list of available tasks in the current environment.
irisml_show [<task_name>]
# Manage a cache storage on Azure Blob Storage
# list - Show a list of matched blobs.
# download - Download matched blobs.
# remove - Remove matched blobs.
# show - Show the contents of matched blobs.
irisml_cache <list|download|remove|show> [--mtime <+|->N] [--name NAME]
Pipeline definition
PipelineDefinition = {"tasks": List[TaskDefinition], "on_error": Optional[List[TaskDescription]]}
TaskDefinition = {
"task": <task module name>,
"name": <optional unique name of the task>,
"inputs": <list of input objects>,
"config": <config for the task. Use irisml_show command to find the available configurations.>
}
In the TaskDefinition.inputs and TaskDefinition.config, you cna use the following two variable.
- $env.<variable_name> This variable will be replaced by the environment variable that was provided as arguments for irisml_run command.
- $outputs.<task_name>.<field_name> This variable will be replaced by the outputs of the specified previous task.
It raises an exception on runtime if the specified variable was not found.
If a task raised an exception, the tasks specified in on_error field will be executed. The exception object will be assigned to "$env.IRISML_EXCEPTION" variable.
Pipeline cache
Using cache, you can modify and re-run a pipeline config with minimum cost. If the cache is enabled, IrisML will calculate hash values for all task inputs/configs and upload the task outputs to a specified storage. When it found a task with same hash values, it can download the cache and skip the task execution.
To enable cache, you must specify the cache storage location by setting IRISML_CACHE_URL environment variable. Currently Azure Blob Storage and local filesystem is supported.
To use Azure Blob Storage, a container URL must be provided. It the URL contains a SAS token, it will be used for authentication. Otherwise, interactive authentication and Managed Identity authentication will be used.
Python API
To run a pipeline from python code, you can use the following APIs.
import json
import pathlib
from irisml.core import JobRunner
job_description = json.loads(pathlib.Path('example.json').read_text())
runner = JobRunner(job_description)
runner.run({'DATASET_NAME': 'mnist'})
runner.run({'DATASET_NAME': 'cifar10'})
Available official tasks
To show the detailed help for each task, run the following command after installing the package.
irisml_show <task_name>
irisml-tasks
Task | Description |
---|---|
assertion | Assert the given input. |
assign_class_to_strings | Assigns a class to a string based on the class name being present in the string. |
branch | 'If' conditional branch. |
calculate_cosine_similarity | Calculate cosine similarity between two sets of vectors. |
check_model_parameters | Check Inf/NaN values in model parameters. |
compare | Compare two values |
deserialize_tensor | Deserialize a pytorch tensor. |
divide_float | Floating point division. |
download_azure_blob | Download a single blob from Azure Blob Storage. |
extract_image_bytes_from_dataset | Extract images from a dataset and convert them to bytes. |
get_current_time | Get the current time in seconds since the epoch |
get_dataset_split | Get a train/val split of a dataset. |
get_dataset_stats | Get statistics of a dataset. |
get_dataset_subset | Get a subset of a dataset. |
get_fake_image_classification_dataset | Generate a fake image classification dataset. |
get_fake_object_detection_dataset | Generate a fake object detection dataset. |
get_int_from_json_strings | Get an integer from a JSON string. |
get_item | Get an item from the given list. |
get_kfold_cross_validation_dataset | Get train/test dataset for k-fold cross validation. |
get_secret_from_azure_keyvault | Get a secret from Azure KeyVault. |
get_topk | Get the largest Topk values and indices. |
join_filepath | Join a given dir_path and a filename. |
load_state_dict | Load a state_dict from various sources. |
make_cached_dataset | Save dataset cache on disk. |
make_prompt_for_each_string | Make a prompt for each string. |
make_prompt_with_strings | Make a prompt with a list of strings. |
pickling_object | Pickling an object. |
Print or Pretty Print the input object. | |
print_environment_info | Print various environment information to stdout/stderr. |
read_file | Reads a file and returns its contents as bytes. |
repeat_tasks | Repeat the given tasks for multiple times. |
run_parallel | Run the given tasks in parallel. A new process will be forked for each task. Each task must have an unique name. |
run_profiler | Run profiler on the given tasks. |
run_sequential | Run the given tasks in sequence. Each task must have an unique name. |
save_file | Save the given input binary to a file. |
save_images_from_dataset | Save images from a dataset to disk. |
save_state_dict | Save the model's state_dict to the specified file. |
search_grid_sequential | Grid search hyperparameters. Tasks are run in sequence. |
serialize_tensor | Serialize a pytorch tensor. |
switch_pick | pick from vals based on conditions. Task will return the first val with condition being True. |
upload_azure_blob | Upload a binary file to Azure Storage Blob. |
irisml-tasks-training
This package contains tasks related to pytorch training
Task | Description |
---|---|
append_classifier | Append a classifier model to a given model. A predictor and a loss module will be added, too. |
benchmark_dataset | Benchmark dataset loading and preprocessing |
benchmark_model | Benchmark a given model using a given dataset. |
benchmark_model_with_grad_cache | Benchmark a given model using a given dataset with grad caching. Useful for cases which require sub batching. |
build_classification_prompt_dataset | Create a classification prompt dataset. |
build_zero_shot_classifier | Create a zero-shot classification layer. |
concatenate_datasets | Concatenate the given two datasets together. |
create_classification_prompt_generator | Create a prompt generator for a classification task. |
evaluate_accuracy | Calculate accuracy of the given prediction results. |
evaluate_detection_average_precision | Calculate mean average precision for object detection task results. |
exclude_negative_samples_from_classification_dataset | Exclude negative samples from classification dataset. |
export_onnx | Export the given model as ONNX. |
get_targets_from_dataset | Extract only targets from a given Dataset. |
load_simple_classification_dataset | Load a simple classification dataset from a directory of images and an index file. |
make_classification_dataset_from_object_detection | Convert an object detection dataset into a classification dataset. |
make_classification_dataset_from_predictions | Make a classification dataset from predictions. |
make_feature_extractor_model | Make a wrapper model to extract a feature vector from a vision model. |
make_fixed_prompt_image_transform | Make a transform function for image and a fixed prompt. |
make_image_text_contrastive_model | Make a model for image-text contrastive training. |
make_image_text_transform | Make a transform function for image-text classification. |
make_oversampled_dataset | Make an oversampled dataset. |
num_iters_to_epochs | Convert number of iterations to number of epochs. Min value is 1. |
predict | Predict using a given model. |
remove_empty_images_from_dataset | Remove empty images from dataset. |
sample_few_shot_dataset | Few-shot sampling of a IC/OD dataset. |
split_image_text_model | Split a image-text model into an image model and a text model. |
train | Train a pytorch model. |
train_with_gradient_cache | Train a model using gradient cache. Useful for contrastive learning with a large model. |
irisml-tasks-torchvision
Adapter tasks for torchvision library.
Task | Description |
---|---|
create_torchvision_model | Create a torchvision model. |
create_torchvision_transform | Create transform objects in torchvision library. |
load_torchvision_dataset | Load a dataset from torchvision package. |
irisml-tasks-transformers
Adapter tasks for HuggingFace transformers library.
Task | Description |
---|---|
create_transformers_model | Create a model using the transformers library. |
create_transformers_tokenizer | Create a tokenizer using the transformers library. |
irisml-tasks-timm
Adapter for models in timm library.
Task | Description |
---|---|
create_timm_model | Create a model using the timm library. |
create_timm_transform | Create a preprocessing function using the timm library. |
irisml-tasks-onnx
Adapter tasks for OnnxRuntime library.
Task | Description |
---|---|
predict_onnx | Run inference for an ONNX model. |
irisml-tasks-azureml
Task | Description |
---|---|
run_azureml_child | Run tasks as a new child AzureML Run. |
add_aml_tag | Tag the AML Run with a string key and optional value. |
irisml-tasks-fiftyone
Task | Description |
---|---|
launch_fiftyone | Launch a fiftyone interface. |
Development
Create a new task
To create a Task, you must define a module that contains a "Task" class. Here is a simple example:
# irisml/tasks/my_custom_task.py
import dataclasses
import irisml.core
class Task(irisml.core.TaskBase): # The class name must be "Task".
VERSION = '1.0.0'
CACHE_ENABLED = True # (default: True) This is optional.
@dataclasses.dataclass
class Inputs: # You can remove this class if the task doesn't require inputs.
int_value: int
float_value: float
@dataclasses.dataclass
class Config: # If there is no configuration, you can remove this class. All fields must be JSON-serializable.
another_float: float
child_dataclass: dataclass # If you'd like to define a nested config, you can define another dataclass.
@dataclasses.dataclass
class Outputs: # Can be removed if the task doesn't have outputs.
float_value: float = 0 # If dry_run() is not implemented, Outputs fields must have default value or default factory.
def execute(self, inputs: Inputs) -> Outputs:
return self.Outputs(inputs.int_value * inputs.float_value * self.config.another_float)
def dry_run(self, inputs: Inputs) -> Outputs: # This method is optional.
return self.Outputs(0) # Must return immediately without actual processing.
Each Task must define "execute" method. The base class has empty implementation for Inputs, Config, Outputs and dry_run(). For the detail, please see the document for TaskBase class.
Related repositories
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.