Skip to main content

Simple ML pipeline platform

Project description

IrisML

Proof of Concept for a simple framework to create a ML pipeline.

Features

  • Run a ML training/inference with a simple JSON configuration.
  • Modularized interfaces for task components.
  • Cache task outputs for faster experiments.

Getting started

Installation

Prerequisite: python 3.8+

# Install the core framework and standard tasks.
pip install irisml irisml-tasks irisml-tasks-training

Run an example job

# Install additional packages that are required for the example
pip install irisml-tasks-torchvision

# Run on local machine
irisml_run docs/examples/mobilenetv2_mnist_training.json

Available commands

# Run the specified pipeline. You can provide environment variables by "-e" option, which will be acceible through $env variable in the json config.
irisml_run <pipeline_json_path> [-e <ENV_NAME>=<env_value>] [--no_cache] [--no_cache_read] [-v]

# Show information about the specified task. If <task_name> is not provided, shows a list of available tasks in the current environment.
irisml_show [<task_name>]

# Manage a cache storage on Azure Blob Storage
# list - Show a list of matched blobs.
# download - Download matched blobs.
# remove - Remove matched blobs.
# show - Show the contents of matched blobs.
irisml_cache <list|download|remove|show> [--mtime <+|->N] [--name NAME]

Pipeline definition

PipelineDefinition = {"tasks": List[TaskDefinition], "on_error": Optional[List[TaskDescription]]}

TaskDefinition = {
    "task": <task module name>,
    "name": <optional unique name of the task>,
    "inputs": <list of input objects>,
    "config": <config for the task. Use irisml_show command to find the available configurations.>
}

In the TaskDefinition.inputs and TaskDefinition.config, you cna use the following two variable.

  • $env.<variable_name> This variable will be replaced by the environment variable that was provided as arguments for irisml_run command.
  • $outputs.<task_name>.<field_name> This variable will be replaced by the outputs of the specified previous task.

It raises an exception on runtime if the specified variable was not found.

If a task raised an exception, the tasks specified in on_error field will be executed. The exception object will be assigned to "$env.IRISML_EXCEPTION" variable.

Pipeline cache

Using cache, you can modify and re-run a pipeline config with minimum cost. If the cache is enabled, IrisML will calculate hash values for all task inputs/configs and upload the task outputs to a specified storage. When it found a task with same hash values, it can download the cache and skip the task execution.

To enable cache, you must specify the cache storage location by setting IRISML_CACHE_URL environment variable. Currently Azure Blob Storage and local filesystem is supported.

To use Azure Blob Storage, a container URL must be provided. It the URL contains a SAS token, it will be used for authentication. Otherwise, interactive authentication and Managed Identity authentication will be used.

Python API

To run a pipeline from python code, you can use the following APIs.

import json
import pathlib
from irisml.core import JobRunner

job_description = json.loads(pathlib.Path('example.json').read_text())
runner = JobRunner(job_description)

runner.run({'DATASET_NAME': 'mnist'})

runner.run({'DATASET_NAME': 'cifar10'})

Available official tasks

To show the detailed help for each task, run the following command after installing the package.

irisml_show <task_name>

irisml-tasks

Task Description
assertion Assert the given input.
assign_class_to_strings Assigns a class to a string based on the class name being present in the string.
branch 'If' conditional branch.
calculate_cosine_similarity Calculate cosine similarity between two sets of vectors.
check_model_parameters Check Inf/NaN values in model parameters.
compare Compare two values
deserialize_tensor Deserialize a pytorch tensor.
divide_float Floating point division.
download_azure_blob Download a single blob from Azure Blob Storage.
extract_image_bytes_from_dataset Extract images from a dataset and convert them to bytes.
get_current_time Get the current time in seconds since the epoch
get_dataset_split Get a train/val split of a dataset.
get_dataset_stats Get statistics of a dataset.
get_dataset_subset Get a subset of a dataset.
get_fake_image_classification_dataset Generate a fake image classification dataset.
get_fake_object_detection_dataset Generate a fake object detection dataset.
get_int_from_json_strings Get an integer from a JSON string.
get_item Get an item from the given list.
get_kfold_cross_validation_dataset Get train/test dataset for k-fold cross validation.
get_secret_from_azure_keyvault Get a secret from Azure KeyVault.
get_topk Get the largest Topk values and indices.
join_filepath Join a given dir_path and a filename.
load_state_dict Load a state_dict from various sources.
make_cached_dataset Save dataset cache on disk.
make_prompt_for_each_string Make a prompt for each string.
make_prompt_with_strings Make a prompt with a list of strings.
pickling_object Pickling an object.
print Print or Pretty Print the input object.
print_environment_info Print various environment information to stdout/stderr.
read_file Reads a file and returns its contents as bytes.
repeat_tasks Repeat the given tasks for multiple times.
run_parallel Run the given tasks in parallel. A new process will be forked for each task. Each task must have an unique name.
run_profiler Run profiler on the given tasks.
run_sequential Run the given tasks in sequence. Each task must have an unique name.
save_file Save the given input binary to a file.
save_images_from_dataset Save images from a dataset to disk.
save_state_dict Save the model's state_dict to the specified file.
search_grid_sequential Grid search hyperparameters. Tasks are run in sequence.
serialize_tensor Serialize a pytorch tensor.
switch_pick pick from vals based on conditions. Task will return the first val with condition being True.
upload_azure_blob Upload a binary file to Azure Storage Blob.

irisml-tasks-training

This package contains tasks related to pytorch training

Task Description
append_classifier Append a classifier model to a given model. A predictor and a loss module will be added, too.
benchmark_dataset Benchmark dataset loading and preprocessing
benchmark_model Benchmark a given model using a given dataset.
benchmark_model_with_grad_cache Benchmark a given model using a given dataset with grad caching. Useful for cases which require sub batching.
build_classification_prompt_dataset Create a classification prompt dataset.
build_zero_shot_classifier Create a zero-shot classification layer.
concatenate_datasets Concatenate the given two datasets together.
create_classification_prompt_generator Create a prompt generator for a classification task.
evaluate_accuracy Calculate accuracy of the given prediction results.
evaluate_detection_average_precision Calculate mean average precision for object detection task results.
exclude_negative_samples_from_classification_dataset Exclude negative samples from classification dataset.
export_onnx Export the given model as ONNX.
get_targets_from_dataset Extract only targets from a given Dataset.
load_simple_classification_dataset Load a simple classification dataset from a directory of images and an index file.
make_classification_dataset_from_object_detection Convert an object detection dataset into a classification dataset.
make_classification_dataset_from_predictions Make a classification dataset from predictions.
make_feature_extractor_model Make a wrapper model to extract a feature vector from a vision model.
make_fixed_prompt_image_transform Make a transform function for image and a fixed prompt.
make_image_text_contrastive_model Make a model for image-text contrastive training.
make_image_text_transform Make a transform function for image-text classification.
make_oversampled_dataset Make an oversampled dataset.
num_iters_to_epochs Convert number of iterations to number of epochs. Min value is 1.
predict Predict using a given model.
remove_empty_images_from_dataset Remove empty images from dataset.
sample_few_shot_dataset Few-shot sampling of a IC/OD dataset.
split_image_text_model Split a image-text model into an image model and a text model.
train Train a pytorch model.
train_with_gradient_cache Train a model using gradient cache. Useful for contrastive learning with a large model.

irisml-tasks-torchvision

Adapter tasks for torchvision library.

Task Description
create_torchvision_model Create a torchvision model.
create_torchvision_transform Create transform objects in torchvision library.
load_torchvision_dataset Load a dataset from torchvision package.

irisml-tasks-transformers

Adapter tasks for HuggingFace transformers library.

Task Description
create_transformers_model Create a model using the transformers library.
create_transformers_tokenizer Create a tokenizer using the transformers library.

irisml-tasks-timm

Adapter for models in timm library.

Task Description
create_timm_model Create a model using the timm library.
create_timm_transform Create a preprocessing function using the timm library.

irisml-tasks-onnx

Adapter tasks for OnnxRuntime library.

Task Description
predict_onnx Run inference for an ONNX model.

irisml-tasks-azureml

Task Description
run_azureml_child Run tasks as a new child AzureML Run.
add_aml_tag Tag the AML Run with a string key and optional value.

irisml-tasks-fiftyone

Task Description
launch_fiftyone Launch a fiftyone interface.

Development

Create a new task

To create a Task, you must define a module that contains a "Task" class. Here is a simple example:

# irisml/tasks/my_custom_task.py
import dataclasses
import irisml.core

class Task(irisml.core.TaskBase):  # The class name must be "Task".
  VERSION = '1.0.0'
  CACHE_ENABLED = True  # (default: True) This is optional.

  @dataclasses.dataclass
  class Inputs:  # You can remove this class if the task doesn't require inputs.
    int_value: int
    float_value: float

  @dataclasses.dataclass
  class Config:  # If there is no configuration, you can remove this class. All fields must be JSON-serializable.
    another_float: float
    child_dataclass: dataclass  # If you'd like to define a nested config, you can define another dataclass.

  @dataclasses.dataclass
  class Outputs:  # Can be removed if the task doesn't have outputs.
    float_value: float = 0  # If dry_run() is not implemented, Outputs fields must have default value or default factory.

  def execute(self, inputs: Inputs) -> Outputs:
    return self.Outputs(inputs.int_value * inputs.float_value * self.config.another_float)

  def dry_run(self, inputs: Inputs) -> Outputs:  # This method is optional.
    return self.Outputs(0)  # Must return immediately without actual processing.

Each Task must define "execute" method. The base class has empty implementation for Inputs, Config, Outputs and dry_run(). For the detail, please see the document for TaskBase class.

Related repositories

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

irisml-0.0.35.tar.gz (34.9 kB view hashes)

Uploaded Source

Built Distribution

irisml-0.0.35-py3-none-any.whl (31.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page