Simple ML pipeline platform
Project description
IrisML
Proof of Concept for a simple framework to create a ML pipeline.
Features
- Run a ML training/inference with a simple JSON configuration.
- Modularized interfaces for task components.
- Cache task outputs for faster experiments.
Getting started
Installation
Prerequisite: python 3.8+
# Install the core framework and standard tasks.
pip install irisml irisml-tasks irisml-tasks-training
Run an example job
# Install additional packages that are required for the example
pip install irisml-tasks-torchvision
# Run on local machine
irisml_run docs/examples/mobilenetv2_mnist_training.json
Available commands
# Run the specified pipeline. You can provide environment variables by "-e" option, which will be acceible through $env variable in the json config.
irisml_run <pipeline_json_path> [-e <ENV_NAME>=<env_value>] [--no_cache] [--no_cache_read] [-v]
# Show information about the specified task. If <task_name> is not provided, shows a list of available tasks in the current environment.
irisml_show [<task_name>]
# Manage a cache storage on Azure Blob Storage
# list - Show a list of matched blobs.
# download - Download matched blobs.
# remove - Remove matched blobs.
# show - Show the contents of matched blobs.
irisml_cache <list|download|remove|show> [--mtime <+|->N] [--name NAME]
Pipeline definition
PipelineDefinition = {"tasks": List[TaskDefinition], "on_error": Optional[List[TaskDescription]]}
TaskDefinition = {
"task": <task module name>,
"name": <optional unique name of the task>,
"inputs": <list of input objects>,
"config": <config for the task. Use irisml_show command to find the available configurations.>
}
In the TaskDefinition.inputs and TaskDefinition.config, you cna use the following two variable.
- $env.<variable_name> This variable will be replaced by the environment variable that was provided as arguments for irisml_run command.
- $outputs.<task_name>.<field_name> This variable will be replaced by the outputs of the specified previous task.
It raises an exception on runtime if the specified variable was not found.
If a task raised an exception, the tasks specified in on_error field will be executed. The exception object will be assigned to "$env.IRISML_EXCEPTION" variable.
Patch definition (Experimental)
PatchesDefinition = {"patches": List[PatchDefinition], "patches_on_error": List[PatchDefinition]} # At least one of the fields must be specified.
PatchDefinition = { # One of the filtering conditions and one of the actions must be specified.
# Filtering conditions
"match": List[MatchCondition],
"match_if_exists": List[MatchCondition], # Matches the task if it exists. If not, the patch will be ignored.
"match_oneof": List[MatchCondition], # Matches the first task that matches one of the conditions.
"top": bool, # Matches the top of the pipeline. Used with "insert" action.
"bottom": bool, # Matches the bottom of the pipeline. Used with "insert" action.
# Actions
"insert": List[TaskDefinition],
"remove": bool,
"replace": Tuple[List[TaskDefinition], Dict[str, str]], # The second element is a mapping from the old output name to the new output name. All "$output" variables will be replaced by the new output name.
"update": TaskDefinition
}
MatchCondition = { # All fields are optional.
"task": str,
"name": str,
"config": Dict[str, Any]
}
The available actions are as follows:
- insert: Insert the specified tasks after the matched task.
- remove: Remove the matched task.
- replace: Replace the matched task with the specified tasks.
- update: Update the matched task with the given configuration.
Note that the patch command doesn't guarantee the correctness of the patched pipeline. It is recommended to validate the patched pipeline.
Pipeline cache
Using cache, you can modify and re-run a pipeline config with minimum cost. If the cache is enabled, IrisML will calculate hash values for all task inputs/configs and upload the task outputs to a specified storage. When it found a task with same hash values, it can download the cache and skip the task execution.
To enable cache, you must specify the cache storage location by setting IRISML_CACHE_URL environment variable. Currently Azure Blob Storage and local filesystem is supported.
To use Azure Blob Storage, a container URL must be provided. It the URL contains a SAS token, it will be used for authentication. Otherwise, interactive authentication and Managed Identity authentication will be used.
Python API
To run a pipeline from python code, you can use the following APIs.
import json
import pathlib
from irisml.core import JobRunner
job_description = json.loads(pathlib.Path('example.json').read_text())
runner = JobRunner(job_description)
runner.run({'DATASET_NAME': 'mnist'})
runner.run({'DATASET_NAME': 'cifar10'})
Available official tasks
To show the detailed help for each task, run the following command after installing the package.
irisml_show <task_name>
irisml-tasks
Task | Description |
---|---|
assertion | Assert the given input. |
assign_class_to_strings | Assigns a class to a string based on the class name being present in the string. |
branch | 'If' conditional branch. |
calculate_cosine_similarity | Calculate cosine similarity between two sets of vectors. |
check_model_parameters | Check Inf/NaN values in model parameters. |
compare | Compare two values |
compare_ints | Compare two int values. |
convert_detection_to_multilabel | Convert targets or predictions of object detection to multilabel. |
convert_string_to_string_list | Convert a string to a list of strings. |
deserialize_tensor | Deserialize a pytorch tensor. |
divide_float | Floating point division. |
download_azure_blob | Download a single blob from Azure Blob Storage. |
emulate_fp8_quantization | Emulate FP8 quantization. |
extract_image_bytes_from_dataset | Extract images from a dataset and convert them to bytes. |
get_current_time | Get the current time in seconds since the epoch |
get_dataset_split | Get a train/val split of a dataset. |
get_dataset_stats | Get statistics of a dataset. |
get_dataset_subset | Get a subset of a dataset. |
get_fake_image_classification_dataset | Generate a fake image classification dataset. |
get_fake_image_text_classification_dataset | Generate a fake image-text classification dataset. |
get_fake_object_detection_dataset | Generate a fake object detection dataset. |
get_fake_phrase_grounding_dataset | Generate a fake phrase grounding dataset. |
get_fake_visual_question_answering_dataset | Generate a fake visual question answering dataset. |
get_int_from_json_strings | Get an integer from a JSON string. |
get_int_list_from_json_strings | Get a list of ints from a JSON string. |
get_item | Get an item from the given list. |
get_key_and_int_list_from_json_string | Parse a JSON string and return a list of keys and a list of lists of ints. |
get_kfold_cross_validation_dataset | Get train/test dataset for k-fold cross validation. |
get_secret_from_azure_keyvault | Get a secret from Azure KeyVault. |
get_topk | Get the largest Topk values and indices. |
join_filepath | Join a given dir_path and a filename. |
join_two_strings | Join two strings to one string. |
load_coco_detections | Load coco detections from a JSON to a list of tensors. |
load_float_tensor_jsonl | Load a 2D float tensor from a JSONL file. |
load_state_dict | Load a state_dict from various sources. |
load_str_list_jsonl | Load a list of strings from a JSONL file. |
load_strs_from_json_file | Load strings from a JSON file. |
load_tensor_list | Load a list of tensors from file. |
make_cached_dataset | Save dataset cache on disk. |
make_prompt_for_each_string | Make a prompt for each string. |
make_prompt_list_with_strings | Make a list of prompts from a template and a list of strings. |
make_prompt_with_strings | Make a prompt with a list of strings. |
make_random_choice_text_transform | Make a text transform function that randomly chooses one of the substrings separated by the delimiter. |
make_text_transform | Make a text transform function. |
map_int_list | Map a list of integers to a list of integers. |
pickling_object | Pickling an object. |
Print or Pretty Print the input object. | |
print_environment_info | Print various environment information to stdout/stderr. |
read_file | Reads a file and returns its contents as bytes. |
repeat_tasks | Repeat the given tasks for multiple times. |
run_parallel | Run the given tasks in parallel. A new process will be forked for each task. Each task must have an unique name. |
run_profiler | Run profiler on the given tasks. |
run_sequential | Run the given tasks in sequence. Each task must have an unique name. |
save_file | Save the given input binary to a file. |
save_float_tensor_jsonl | Save a 2D float tensor to a JSONL file. |
save_images_from_dataset | Save images from a dataset to disk. |
save_jit_model | Save an offline version of a pytorch model. torch.jit.save() |
save_state_dict | Save the model's state_dict to the specified file. |
save_str_list_jsonl | Save a list of strings to a JSONL file. |
search_grid_sequential | Grid search hyperparameters. Tasks are run in sequence. |
serialize_tensor | Serialize a pytorch tensor. |
split_string | Split string to a list of strings. |
switch_pick | pick from vals based on conditions. Task will return the first val with condition being True. |
upload_azure_blob | Upload a binary file to Azure Storage Blob. |
upload_azure_blob_directory | Upload a directory to Azure Blob Storage. |
irisml-tasks-training
This package contains tasks related to pytorch training.
Task | Description |
---|---|
append_classifier | Append a classifier model to a given model. A predictor and a loss module will be added, too. |
benchmark_dataset | Benchmark dataset loading and preprocessing |
benchmark_model | Benchmark a given model using a given dataset. |
benchmark_model_with_grad_cache | Benchmark a given model using a given dataset with grad caching. Useful for cases which require sub batching. |
build_classification_prompt_dataset | Create a classification prompt dataset. |
build_zero_shot_classifier | Create a zero-shot classification layer. |
concatenate_datasets | Concatenate the given two datasets together. |
convert_vqa_dataset_to_image_text_classification_dataset | Convert VQA dataset to image text classification dataset. |
create_classification_prompt_generator | Create a prompt generator for a classification task. |
create_prompt_generator | Create a prompt generator that returns a list of prompts for a given label. |
evaluate_accuracy | Calculate accuracy of the given prediction results. |
evaluate_captioning | Evaluate captioning prediction results. |
evaluate_detection_average_precision | Calculate mean average precision for object detection task results. |
evaluate_phrase_grounding | Calculate precision/recall for phrase grounding. |
evaluate_phrase_grounding_recall | Calculate recall for phrase grounding. |
evaluate_string_matching_accuracy | Calculate accuracy of string matching. |
exclude_negative_samples_from_classification_dataset | Exclude negative samples from classification dataset. |
export_coco_from_torch_dataset | Export coco dataset from a given torch dataset. Support IC and OD only. |
export_onnx | Export the given model as ONNX. |
extract_val_by_key_from_jsonl | Extract value for each entry in a JSONL by a key. |
find_incorrect_classification_indices | Find incorrect classification indices. |
find_incorrect_classification_multilabel_indices | Find incorrect classification indices for multilabel classification. |
flatten_captioning_dataset | Flatten a captioning dataset with multiple targets per image into a dataset with a single target per image. |
get_questions_from_vqa_dataset | Extracts questions from a VQA dataset. |
get_subclass_dataset | Get the sub-dataset with given class ids from a dataset. |
get_targets_from_dataset | Extract only targets from a given Dataset. |
load_jsonl_vqa_dataset | Load a VQA dataset from a jsonl file. |
load_simple_classification_dataset | Load a simple classification dataset from a directory of images and an index file. |
make_classification_dataset_from_object_detection | Convert an object detection dataset into a classification dataset. |
make_classification_dataset_from_predictions | Make a classification dataset from predictions. |
make_detection_dataset_from_predictions | Make a detection dataset from predictions. |
make_feature_extractor_model | Make a wrapper model to extract a feature vector from a vision model. |
make_fixed_prompt_image_transform | Make a transform function for image and a fixed prompt. |
make_fixed_text_dataset | Create a dataset with a list of strings. |
make_image_text_contrastive_model | Make a model for image-text contrastive training. |
make_image_text_transform | Make a transform function for image-text classification. |
make_oversampled_dataset | Make an oversampled dataset. |
make_phrase_grounding_image_transform | Make phrase grounding image transform. |
make_prompt_list_image_transform | Make a transform function for image and prompt list. |
make_vqa_collate_function | Creates a collate_function for Visual Question Answering (VQA) and Phrase Grounding task. |
make_vqa_image_transform | Creates a transform function for VQA task. |
map_classification_predictions_to_detection | Map classification predictions back to detection predictions or targets. |
num_iters_to_epochs | Convert number of iterations to number of epochs. Min value is 1. |
predict | Predict using a given model. |
remove_empty_images_from_dataset | Remove empty images from dataset. |
sample_few_shot_dataset | Few-shot sampling of a IC/OD dataset. |
save_jsonl_vqa_dataset | Save a VQA dataset to a JSONL file. |
split_image_text_model | Split a image-text model into an image model and a text model. |
train | Train a pytorch model. |
train_with_gradient_cache | Train a model using gradient cache. Useful for contrastive learning with a large model. |
irisml-tasks-azure-computervision
Task | Description |
---|---|
create_azure_computervision_caption_model | Create Azure Computer Vision Caption Model. |
create_azure_computervision_classification_model | Create Azure Computer Vision Caption Model. |
create_azure_computervision_custom_model | Create a model that run inference with a custom model in Azure Computer Vision. |
create_azure_computervision_ocr_model | Create Azure Computer Vision OCR model. |
create_azure_computervision_product_recognizer_model | Create a model that run inference with a product recognizer model in Azure Computer Vision. |
create_azure_computervision_vectorization_model | Create Azure Computer Vision Vectorization Model. |
delete_azure_computervision_custom_model | Delete Azure Computer Vision Custom Model. |
train_azure_computervision_custom_model | Train Azure Computer Vision Custom Model. |
irisml-tasks-azure-customvision
Task | Description |
---|---|
create_azure_customvision_docker_model | Create a model from an exported Azure Custom Vision Docker image. |
create_azure_customvision_model | Create a prediction model from an Azure Custom Vision project. |
create_azure_customvision_project | Create a new Azure Custom Vision project. |
delete_azure_customvision_project | Delete an Azure Custom Vision project |
export_azure_customvision_model | Export a model from an Azure Custom Vision project. |
train_azure_customvision_project | Train an Azure Custom Vision project. |
irisml-tasks-azure-openai
Task | Description |
---|---|
call_azure_openai_completion | Call Azure OpenAI Text Completion API. |
create_azure_openai_chat_model | Create a model that generates text using Azure OpenAI completion API. |
create_azure_openai_completion_model | Create a model that generates text using Azure OpenAI completion API. |
irisml-tasks-azureml
Task | Description |
---|---|
run_azureml_child | Run tasks as a new child AzureML Run. |
irisml-tasks-fiftyone
Task | Description |
---|---|
launch_fiftyone | Launch a fiftyone app. |
irisml-tasks-llava
Task | Description |
---|---|
create_llava_model | Create a LLaVA model from a pretrained weights. |
irisml-tasks-onnx
Adapter tasks for OnnxRuntime library.
Task | Description |
---|---|
benchmark_onnx | Bencharmk a given onnx model using onnxruntime. |
predict_onnx | Predict using a given onnx model traced with the export_onnx task |
irisml-tasks-timm
Adapter for models in timm library.
Task | Description |
---|---|
create_timm_model | Create a timm model. |
create_timm_transform | Create timm transforms. |
irisml-tasks-torchmetrics
Adapter tasks for torchmetrics library.
Task | Description |
---|---|
evaluate_torchmetrics_classification_multiclass | Evaluate predictions results using torchmetrics classification metrics for multiclass classification problems. |
evaluate_torchmetrics_classification_multilabel | Evaluate predictions results using torchmetrics classification metrics for multilabel classification problems. |
irisml-tasks-torchvision
Adapter tasks for torchvision library.
Task | Description |
---|---|
create_torchvision_model | Create a torchvision model. |
create_torchvision_transform | Create transform objects in torchvision library. |
create_torchvision_transform_v2 | Create torchvision transform v2 object from string expressions. |
load_torchvision_dataset | Load a dataset from torchvision package. |
irisml-tasks-transformers
Adapter tasks for HuggingFace transformers library.
Task | Description |
---|---|
cache_transformers_model_on_azure_blob | Cache a model from transformers on Azure Blob Storage. |
create_transformers_model | Create a model using transformers library. |
create_transformers_raw_tokenizer | Create a Tokenizer using transformers library. Return the tokenizer as-is. |
create_transformers_text_model | Create a text-generation model using transformers library. |
create_transformers_tokenizer | Create a Tokenizer using transformers library. |
Development
Create a new task
To create a Task, you must define a module that contains a "Task" class. Here is a simple example:
# irisml/tasks/my_custom_task.py
import dataclasses
import irisml.core
class Task(irisml.core.TaskBase): # The class name must be "Task".
VERSION = '1.0.0'
CACHE_ENABLED = True # (default: True) This is optional.
@dataclasses.dataclass
class Inputs: # You can remove this class if the task doesn't require inputs.
int_value: int
float_value: float
@dataclasses.dataclass
class Config: # If there is no configuration, you can remove this class. All fields must be JSON-serializable.
another_float: float
child_dataclass: dataclass # If you'd like to define a nested config, you can define another dataclass.
@dataclasses.dataclass
class Outputs: # Can be removed if the task doesn't have outputs.
float_value: float = 0 # If dry_run() is not implemented, Outputs fields must have default value or default factory.
def execute(self, inputs: Inputs) -> Outputs:
return self.Outputs(inputs.int_value * inputs.float_value * self.config.another_float)
def dry_run(self, inputs: Inputs) -> Outputs: # This method is optional.
return self.Outputs(0) # Must return immediately without actual processing.
Each Task must define "execute" method. The base class has empty implementation for Inputs, Config, Outputs and dry_run(). For the detail, please see the document for TaskBase class.
Related repositories
- irisml-tasks
- irisml-tasks-training
- irisml-tasks-azure-computervision
- irisml-tasks-azure-customvision
- irisml-tasks-azure-openai
- irisml-tasks-azureml
- irisml-tasks-fiftyone
- irisml-tasks-llava
- irisml-tasks-onnx
- irisml-tasks-torchmetrics
- irisml-tasks-torchvision
- irisml-tasks-transformers
- irisml-tasks-timm
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file irisml-0.0.38.tar.gz
.
File metadata
- Download URL: irisml-0.0.38.tar.gz
- Upload date:
- Size: 43.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7988a6c931718be17f1f88de17fa8442c6d7bce817884da17bd372450a10aeff |
|
MD5 | 803e4599999a82f08209860ab064c2d2 |
|
BLAKE2b-256 | 2254ce1f50f66e686e85768af503d27c31d5856b87ac19f507f33d9b194408c3 |
File details
Details for the file irisml-0.0.38-py3-none-any.whl
.
File metadata
- Download URL: irisml-0.0.38-py3-none-any.whl
- Upload date:
- Size: 37.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08797e5543f2bb92425bca1bbd119af9fa950c16aec7f188b812ea330c7ba3db |
|
MD5 | 792c7fe24e6a47d45cffaa5ebcfea880 |
|
BLAKE2b-256 | d37eecad6e59b7adf876aa39c09ee1f4ad8b37ab8ae290bc316c426bc3019e97 |