Simple ML pipeline platform
Project description
IrisML
Proof of Concept for a simple framework to create a ML pipeline.
Features
- Run a ML training/inference with a simple JSON configuration.
- Modularized interfaces for task components.
- Cache task outputs for faster experiments.
Getting started
Installation
Prerequisite: python 3.8+
# Install the core framework and standard tasks.
pip install irisml irisml-tasks irisml-tasks-training
Run an example job
# Install additional packages that are required for the example
pip install irisml-tasks-torchvision
# Run on local machine
irisml_run docs/examples/mobilenetv2_mnist_training.json
Available commands
# Run the specified pipeline. You can provide environment variables by "-e" option, which will be acceible through $env variable in the json config.
irisml_run <pipeline_json_path> [-e <ENV_NAME>=<env_value>] [--no_cache] [--no_cache_read] [-v]
# Show information about the specified task. If <task_name> is not provided, shows a list of available tasks in the current environment.
irisml_show [<task_name>]
# Manage a cache storage on Azure Blob Storage
# list - Show a list of matched blobs.
# download - Download matched blobs.
# remove - Remove matched blobs.
# show - Show the contents of matched blobs.
irisml_cache <list|download|remove|show> [--mtime <+|->N] [--name NAME]
Pipeline definition
PipelineDefinition = {"tasks": List[TaskDefinition], "on_error": Optional[List[TaskDescription]]}
TaskDefinition = {
"task": <task module name>,
"name": <optional unique name of the task>,
"inputs": <list of input objects>,
"config": <config for the task. Use irisml_show command to find the available configurations.>
}
In the TaskDefinition.inputs and TaskDefinition.config, you cna use the following two variable.
- $env.<variable_name> This variable will be replaced by the environment variable that was provided as arguments for irisml_run command.
- $outputs.<task_name>.<field_name> This variable will be replaced by the outputs of the specified previous task.
It raises an exception on runtime if the specified variable was not found.
If a task raised an exception, the tasks specified in on_error field will be executed. The exception object will be assigned to "$env.IRISML_EXCEPTION" variable.
Pipeline cache
Using cache, you can modify and re-run a pipeline config with minimum cost. If the cache is enabled, IrisML will calculate hash values for all task inputs/configs and upload the task outputs to a specified storage. When it found a task with same hash values, it can download the cache and skip the task execution.
To enable cache, you must specify the cache storage location by setting IRISML_CACHE_URL environment variable. Currently Azure Blob Storage and local filesystem is supported.
To use Azure Blob Storage, a container URL must be provided. It the URL contains a SAS token, it will be used for authentication. Otherwise, interactive authentication and Managed Identity authentication will be used.
Python API
To run a pipeline from python code, you can use the following APIs.
import json
import pathlib
from irisml.core import JobRunner
job_description = json.loads(pathlib.Path('example.json').read_text())
runner = JobRunner(job_description)
runner.run({'DATASET_NAME': 'mnist'})
runner.run({'DATASET_NAME': 'cifar10'})
Available official tasks
To show the detailed help for each task, run the following command after installing the package.
irisml_show <task_name>
irisml-tasks
Task | Description |
---|---|
assertion | Assert the given input. |
assign_class_to_strings | Assigns a class to a string based on the class name being present in the string. |
branch | 'If' conditional branch. |
calculate_cosine_similarity | Calculate cosine similarity between two sets of vectors. |
check_model_parameters | Check Inf/NaN values in model parameters. |
compare | Compare two values |
compare_ints | Compare two int values. |
convert_detection_to_multilabel | Convert targets or predictions of object detection to multilabel. |
convert_string_to_string_list | Convert a string to a list of strings. |
deserialize_tensor | Deserialize a pytorch tensor. |
divide_float | Floating point division. |
download_azure_blob | Download a single blob from Azure Blob Storage. |
emulate_fp8_quantization | Emulate FP8 quantization. |
extract_image_bytes_from_dataset | Extract images from a dataset and convert them to bytes. |
get_current_time | Get the current time in seconds since the epoch |
get_dataset_split | Get a train/val split of a dataset. |
get_dataset_stats | Get statistics of a dataset. |
get_dataset_subset | Get a subset of a dataset. |
get_fake_image_classification_dataset | Generate a fake image classification dataset. |
get_fake_image_text_classification_dataset | Generate a fake image-text classification dataset. |
get_fake_object_detection_dataset | Generate a fake object detection dataset. |
get_fake_phrase_grounding_dataset | Generate a fake phrase grounding dataset. |
get_fake_visual_question_answering_dataset | Generate a fake visual question answering dataset. |
get_int_from_json_strings | Get an integer from a JSON string. |
get_int_list_from_json_strings | Get a list of ints from a JSON string. |
get_item | Get an item from the given list. |
get_key_and_int_list_from_json_string | Parse a JSON string and return a list of keys and a list of lists of ints. |
get_kfold_cross_validation_dataset | Get train/test dataset for k-fold cross validation. |
get_secret_from_azure_keyvault | Get a secret from Azure KeyVault. |
get_topk | Get the largest Topk values and indices. |
join_filepath | Join a given dir_path and a filename. |
join_two_strings | Join two strings to one string. |
load_coco_detections | Load coco detections from a JSON to a list of tensors. |
load_float_tensor_jsonl | Load a 2D float tensor from a JSONL file. |
load_state_dict | Load a state_dict from various sources. |
load_str_list_jsonl | Load a list of strings from a JSONL file. |
load_strs_from_json_file | Load strings from a JSON file. |
load_tensor_list | Load a list of tensors from file. |
make_cached_dataset | Save dataset cache on disk. |
make_prompt_for_each_string | Make a prompt for each string. |
make_prompt_list_with_strings | Make a list of prompts from a template and a list of strings. |
make_prompt_with_strings | Make a prompt with a list of strings. |
make_random_choice_text_transform | Make a text transform function that randomly chooses one of the substrings separated by the delimiter. |
make_text_transform | Make a text transform function. |
map_int_list | Map a list of integers to a list of integers. |
pickling_object | Pickling an object. |
Print or Pretty Print the input object. | |
print_environment_info | Print various environment information to stdout/stderr. |
read_file | Reads a file and returns its contents as bytes. |
repeat_tasks | Repeat the given tasks for multiple times. |
run_parallel | Run the given tasks in parallel. A new process will be forked for each task. Each task must have an unique name. |
run_profiler | Run profiler on the given tasks. |
run_sequential | Run the given tasks in sequence. Each task must have an unique name. |
save_file | Save the given input binary to a file. |
save_float_tensor_jsonl | Save a 2D float tensor to a JSONL file. |
save_images_from_dataset | Save images from a dataset to disk. |
save_jit_model | Save an offline version of a pytorch model. torch.jit.save() |
save_state_dict | Save the model's state_dict to the specified file. |
save_str_list_jsonl | Save a list of strings to a JSONL file. |
search_grid_sequential | Grid search hyperparameters. Tasks are run in sequence. |
serialize_tensor | Serialize a pytorch tensor. |
split_string | Split string to a list of strings. |
switch_pick | pick from vals based on conditions. Task will return the first val with condition being True. |
upload_azure_blob | Upload a binary file to Azure Storage Blob. |
upload_azure_blob_directory | Upload a directory to Azure Blob Storage. |
irisml-tasks-training
This package contains tasks related to pytorch training.
Task | Description |
---|---|
append_classifier | Append a classifier model to a given model. A predictor and a loss module will be added, too. |
benchmark_dataset | Benchmark dataset loading and preprocessing |
benchmark_model | Benchmark a given model using a given dataset. |
benchmark_model_with_grad_cache | Benchmark a given model using a given dataset with grad caching. Useful for cases which require sub batching. |
build_classification_prompt_dataset | Create a classification prompt dataset. |
build_zero_shot_classifier | Create a zero-shot classification layer. |
concatenate_datasets | Concatenate the given two datasets together. |
convert_vqa_dataset_to_image_text_classification_dataset | Convert VQA dataset to image text classification dataset. |
create_classification_prompt_generator | Create a prompt generator for a classification task. |
create_prompt_generator | Create a prompt generator that returns a list of prompts for a given label. |
evaluate_accuracy | Calculate accuracy of the given prediction results. |
evaluate_captioning | Evaluate captioning prediction results. |
evaluate_detection_average_precision | Calculate mean average precision for object detection task results. |
evaluate_phrase_grounding | Calculate precision/recall for phrase grounding. |
evaluate_phrase_grounding_recall | Calculate recall for phrase grounding. |
evaluate_string_matching_accuracy | Calculate accuracy of string matching. |
exclude_negative_samples_from_classification_dataset | Exclude negative samples from classification dataset. |
export_coco_from_torch_dataset | Export coco dataset from a given torch dataset. Support IC and OD only. |
export_onnx | Export the given model as ONNX. |
extract_val_by_key_from_jsonl | Extract value for each entry in a JSONL by a key. |
find_incorrect_classification_indices | Find incorrect classification indices. |
find_incorrect_classification_multilabel_indices | Find incorrect classification indices for multilabel classification. |
flatten_captioning_dataset | Flatten a captioning dataset with multiple targets per image into a dataset with a single target per image. |
get_questions_from_vqa_dataset | Extracts questions from a VQA dataset. |
get_subclass_dataset | Get the sub-dataset with given class ids from a dataset. |
get_targets_from_dataset | Extract only targets from a given Dataset. |
load_jsonl_vqa_dataset | Load a VQA dataset from a jsonl file. |
load_simple_classification_dataset | Load a simple classification dataset from a directory of images and an index file. |
make_classification_dataset_from_object_detection | Convert an object detection dataset into a classification dataset. |
make_classification_dataset_from_predictions | Make a classification dataset from predictions. |
make_detection_dataset_from_predictions | Make a detection dataset from predictions. |
make_feature_extractor_model | Make a wrapper model to extract a feature vector from a vision model. |
make_fixed_prompt_image_transform | Make a transform function for image and a fixed prompt. |
make_fixed_text_dataset | Create a dataset with a list of strings. |
make_image_text_contrastive_model | Make a model for image-text contrastive training. |
make_image_text_transform | Make a transform function for image-text classification. |
make_oversampled_dataset | Make an oversampled dataset. |
make_phrase_grounding_image_transform | Make phrase grounding image transform. |
make_prompt_list_image_transform | Make a transform function for image and prompt list. |
make_vqa_collate_function | Creates a collate_function for Visual Question Answering (VQA) and Phrase Grounding task. |
make_vqa_image_transform | Creates a transform function for VQA task. |
map_classification_predictions_to_detection | Map classification predictions back to detection predictions or targets. |
num_iters_to_epochs | Convert number of iterations to number of epochs. Min value is 1. |
predict | Predict using a given model. |
remove_empty_images_from_dataset | Remove empty images from dataset. |
sample_few_shot_dataset | Few-shot sampling of a IC/OD dataset. |
save_jsonl_vqa_dataset | Save a VQA dataset to a JSONL file. |
split_image_text_model | Split a image-text model into an image model and a text model. |
train | Train a pytorch model. |
train_with_gradient_cache | Train a model using gradient cache. Useful for contrastive learning with a large model. |
irisml-tasks-azure-computervision
Task | Description |
---|---|
create_azure_computervision_caption_model | Create Azure Computer Vision Caption Model. |
create_azure_computervision_classification_model | Create Azure Computer Vision Caption Model. |
create_azure_computervision_custom_model | Create a model that run inference with a custom model in Azure Computer Vision. |
create_azure_computervision_ocr_model | Create Azure Computer Vision OCR model. |
create_azure_computervision_product_recognizer_model | Create a model that run inference with a product recognizer model in Azure Computer Vision. |
create_azure_computervision_vectorization_model | Create Azure Computer Vision Vectorization Model. |
delete_azure_computervision_custom_model | Delete Azure Computer Vision Custom Model. |
train_azure_computervision_custom_model | Train Azure Computer Vision Custom Model. |
irisml-tasks-azure-customvision
Task | Description |
---|---|
create_azure_customvision_docker_model | Create a model from an exported Azure Custom Vision Docker image. |
create_azure_customvision_model | Create a prediction model from an Azure Custom Vision project. |
create_azure_customvision_project | Create a new Azure Custom Vision project. |
delete_azure_customvision_project | Delete an Azure Custom Vision project |
export_azure_customvision_model | Export a model from an Azure Custom Vision project. |
train_azure_customvision_project | Train an Azure Custom Vision project. |
irisml-tasks-azure-openai
Task | Description |
---|---|
call_azure_openai_completion | Call Azure OpenAI Text Completion API. |
create_azure_openai_chat_model | Create a model that generates text using Azure OpenAI completion API. |
create_azure_openai_completion_model | Create a model that generates text using Azure OpenAI completion API. |
irisml-tasks-azureml
Task | Description |
---|---|
run_azureml_child | Run tasks as a new child AzureML Run. |
irisml-tasks-fiftyone
Task | Description |
---|---|
launch_fiftyone | Launch a fiftyone app. |
irisml-tasks-llava
Task | Description |
---|---|
create_llava_model | Create a LLaVA model from a pretrained weights. |
irisml-tasks-onnx
Adapter tasks for OnnxRuntime library.
Task | Description |
---|---|
benchmark_onnx | Bencharmk a given onnx model using onnxruntime. |
predict_onnx | Predict using a given onnx model traced with the export_onnx task |
irisml-tasks-timm
Adapter for models in timm library.
Task | Description |
---|---|
create_timm_model | Create a timm model. |
create_timm_transform | Create timm transforms. |
irisml-tasks-torchmetrics
Adapter tasks for torchmetrics library.
Task | Description |
---|---|
evaluate_torchmetrics_classification_multiclass | Evaluate predictions results using torchmetrics classification metrics for multiclass classification problems. |
evaluate_torchmetrics_classification_multilabel | Evaluate predictions results using torchmetrics classification metrics for multilabel classification problems. |
irisml-tasks-torchvision
Adapter tasks for torchvision library.
Task | Description |
---|---|
create_torchvision_model | Create a torchvision model. |
create_torchvision_transform | Create transform objects in torchvision library. |
create_torchvision_transform_v2 | Create torchvision transform v2 object from string expressions. |
load_torchvision_dataset | Load a dataset from torchvision package. |
irisml-tasks-transformers
Adapter tasks for HuggingFace transformers library.
Task | Description |
---|---|
cache_transformers_model_on_azure_blob | Cache a model from transformers on Azure Blob Storage. |
create_transformers_model | Create a model using transformers library. |
create_transformers_raw_tokenizer | Create a Tokenizer using transformers library. Return the tokenizer as-is. |
create_transformers_text_model | Create a text-generation model using transformers library. |
create_transformers_tokenizer | Create a Tokenizer using transformers library. |
Development
Create a new task
To create a Task, you must define a module that contains a "Task" class. Here is a simple example:
# irisml/tasks/my_custom_task.py
import dataclasses
import irisml.core
class Task(irisml.core.TaskBase): # The class name must be "Task".
VERSION = '1.0.0'
CACHE_ENABLED = True # (default: True) This is optional.
@dataclasses.dataclass
class Inputs: # You can remove this class if the task doesn't require inputs.
int_value: int
float_value: float
@dataclasses.dataclass
class Config: # If there is no configuration, you can remove this class. All fields must be JSON-serializable.
another_float: float
child_dataclass: dataclass # If you'd like to define a nested config, you can define another dataclass.
@dataclasses.dataclass
class Outputs: # Can be removed if the task doesn't have outputs.
float_value: float = 0 # If dry_run() is not implemented, Outputs fields must have default value or default factory.
def execute(self, inputs: Inputs) -> Outputs:
return self.Outputs(inputs.int_value * inputs.float_value * self.config.another_float)
def dry_run(self, inputs: Inputs) -> Outputs: # This method is optional.
return self.Outputs(0) # Must return immediately without actual processing.
Each Task must define "execute" method. The base class has empty implementation for Inputs, Config, Outputs and dry_run(). For the detail, please see the document for TaskBase class.
Related repositories
- irisml-tasks
- irisml-tasks-training
- irisml-tasks-azure-computervision
- irisml-tasks-azure-customvision
- irisml-tasks-azure-openai
- irisml-tasks-azureml
- irisml-tasks-fiftyone
- irisml-tasks-llava
- irisml-tasks-onnx
- irisml-tasks-torchmetrics
- irisml-tasks-torchvision
- irisml-tasks-transformers
- irisml-tasks-timm
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.