Simple ML pipeline platform

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

IrisML

Proof of Concept for a simple framework to create a ML pipeline.

Features

Run a ML training/inference with a simple JSON configuration.
Modularized interfaces for task components.
Cache task outputs for faster experiments.

Getting started

Installation

Prerequisite: python 3.8+

# Install the core framework and standard tasks.
pip install irisml irisml-tasks irisml-tasks-training

Run an example job

# Install additional packages that are required for the example
pip install irisml-tasks-torchvision

# Run on local machine
irisml_run docs/examples/mobilenetv2_mnist_training.json

Available commands

# Run the specified pipeline. You can provide environment variables by "-e" option, which will be acceible through $env variable in the json config.
irisml_run <pipeline_json_path> [-e <ENV_NAME>=<env_value>] [--no_cache] [--no_cache_read] [-v]

# Show information about the specified task. If <task_name> is not provided, shows a list of available tasks in the current environment.
irisml_show [<task_name>]

# Manage a cache storage on Azure Blob Storage
# list - Show a list of matched blobs.
# download - Download matched blobs.
# remove - Remove matched blobs.
# show - Show the contents of matched blobs.
irisml_cache <list|download|remove|show> [--mtime <+|->N] [--name NAME]

Pipeline definition

PipelineDefinition = {"tasks": List[TaskDefinition], "on_error": Optional[List[TaskDescription]]}

TaskDefinition = {
    "task": <task module name>,
    "name": <optional unique name of the task>,
    "inputs": <list of input objects>,
    "config": <config for the task. Use irisml_show command to find the available configurations.>
}

In the TaskDefinition.inputs and TaskDefinition.config, you cna use the following two variable.

$env.<variable_name> This variable will be replaced by the environment variable that was provided as arguments for irisml_run command.
$outputs.<task_name>.<field_name> This variable will be replaced by the outputs of the specified previous task.

It raises an exception on runtime if the specified variable was not found.

If a task raised an exception, the tasks specified in on_error field will be executed. The exception object will be assigned to "$env.IRISML_EXCEPTION" variable.

Pipeline cache

Using cache, you can modify and re-run a pipeline config with minimum cost. If the cache is enabled, IrisML will calculate hash values for all task inputs/configs and upload the task outputs to a specified storage. When it found a task with same hash values, it can download the cache and skip the task execution.

To enable cache, you must specify the cache storage location by setting IRISML_CACHE_URL environment variable. Currently Azure Blob Storage and local filesystem is supported.

To use Azure Blob Storage, a container URL must be provided. It the URL contains a SAS token, it will be used for authentication. Otherwise, interactive authentication and Managed Identity authentication will be used.

Python API

To run a pipeline from python code, you can use the following APIs.

import json
import pathlib
from irisml.core import JobRunner

job_description = json.loads(pathlib.Path('example.json').read_text())
runner = JobRunner(job_description)

runner.run({'DATASET_NAME': 'mnist'})

runner.run({'DATASET_NAME': 'cifar10'})

Available official tasks

To show the detailed help for each task, run the following command after installing the package.

irisml_show <task_name>

irisml-tasks

Task	Description
assertion	Assert the given input.
assign_class_to_strings	Assigns a class to a string based on the class name being present in the string.
branch	'If' conditional branch.
calculate_cosine_similarity	Calculate cosine similarity between two sets of vectors.
check_model_parameters	Check Inf/NaN values in model parameters.
compare	Compare two values
deserialize_tensor	Deserialize a pytorch tensor.
divide_float	Floating point division.
download_azure_blob	Download a single blob from Azure Blob Storage.
extract_image_bytes_from_dataset	Extract images from a dataset and convert them to bytes.
get_current_time	Get the current time in seconds since the epoch
get_dataset_split	Get a train/val split of a dataset.
get_dataset_stats	Get statistics of a dataset.
get_dataset_subset	Get a subset of a dataset.
get_fake_image_classification_dataset	Generate a fake image classification dataset.
get_fake_object_detection_dataset	Generate a fake object detection dataset.
get_int_from_json_strings	Get an integer from a JSON string.
get_item	Get an item from the given list.
get_kfold_cross_validation_dataset	Get train/test dataset for k-fold cross validation.
get_secret_from_azure_keyvault	Get a secret from Azure KeyVault.
get_topk	Get the largest Topk values and indices.
join_filepath	Join a given dir_path and a filename.
load_state_dict	Load a state_dict from various sources.
make_cached_dataset	Save dataset cache on disk.
make_prompt_for_each_string	Make a prompt for each string.
make_prompt_with_strings	Make a prompt with a list of strings.
pickling_object	Pickling an object.
print	Print or Pretty Print the input object.
print_environment_info	Print various environment information to stdout/stderr.
read_file	Reads a file and returns its contents as bytes.
repeat_tasks	Repeat the given tasks for multiple times.
run_parallel	Run the given tasks in parallel. A new process will be forked for each task. Each task must have an unique name.
run_profiler	Run profiler on the given tasks.
run_sequential	Run the given tasks in sequence. Each task must have an unique name.
save_file	Save the given input binary to a file.
save_images_from_dataset	Save images from a dataset to disk.
save_state_dict	Save the model's state_dict to the specified file.
search_grid_sequential	Grid search hyperparameters. Tasks are run in sequence.
serialize_tensor	Serialize a pytorch tensor.
switch_pick	pick from vals based on conditions. Task will return the first val with condition being True.
upload_azure_blob	Upload a binary file to Azure Storage Blob.

irisml-tasks-training

This package contains tasks related to pytorch training

Task	Description
append_classifier	Append a classifier model to a given model. A predictor and a loss module will be added, too.
benchmark_dataset	Benchmark dataset loading and preprocessing
benchmark_model	Benchmark a given model using a given dataset.
benchmark_model_with_grad_cache	Benchmark a given model using a given dataset with grad caching. Useful for cases which require sub batching.
build_classification_prompt_dataset	Create a classification prompt dataset.
build_zero_shot_classifier	Create a zero-shot classification layer.
concatenate_datasets	Concatenate the given two datasets together.
create_classification_prompt_generator	Create a prompt generator for a classification task.
evaluate_accuracy	Calculate accuracy of the given prediction results.
evaluate_detection_average_precision	Calculate mean average precision for object detection task results.
exclude_negative_samples_from_classification_dataset	Exclude negative samples from classification dataset.
export_onnx	Export the given model as ONNX.
get_targets_from_dataset	Extract only targets from a given Dataset.
load_simple_classification_dataset	Load a simple classification dataset from a directory of images and an index file.
make_classification_dataset_from_object_detection	Convert an object detection dataset into a classification dataset.
make_classification_dataset_from_predictions	Make a classification dataset from predictions.
make_feature_extractor_model	Make a wrapper model to extract a feature vector from a vision model.
make_fixed_prompt_image_transform	Make a transform function for image and a fixed prompt.
make_image_text_contrastive_model	Make a model for image-text contrastive training.
make_image_text_transform	Make a transform function for image-text classification.
make_oversampled_dataset	Make an oversampled dataset.
num_iters_to_epochs	Convert number of iterations to number of epochs. Min value is 1.
predict	Predict using a given model.
remove_empty_images_from_dataset	Remove empty images from dataset.
sample_few_shot_dataset	Few-shot sampling of a IC/OD dataset.
split_image_text_model	Split a image-text model into an image model and a text model.
train	Train a pytorch model.
train_with_gradient_cache	Train a model using gradient cache. Useful for contrastive learning with a large model.

irisml-tasks-torchvision

Adapter tasks for torchvision library.

Task	Description
create_torchvision_model	Create a torchvision model.
create_torchvision_transform	Create transform objects in torchvision library.
load_torchvision_dataset	Load a dataset from torchvision package.

irisml-tasks-transformers

Adapter tasks for HuggingFace transformers library.

Task	Description
create_transformers_model	Create a model using the transformers library.
create_transformers_tokenizer	Create a tokenizer using the transformers library.

irisml-tasks-timm

Adapter for models in timm library.

Task	Description
create_timm_model	Create a model using the timm library.
create_timm_transform	Create a preprocessing function using the timm library.

irisml-tasks-onnx

Adapter tasks for OnnxRuntime library.

Task	Description
predict_onnx	Run inference for an ONNX model.

irisml-tasks-azureml

Task	Description
run_azureml_child	Run tasks as a new child AzureML Run.
add_aml_tag	Tag the AML Run with a string key and optional value.

irisml-tasks-fiftyone

Task	Description
launch_fiftyone	Launch a fiftyone interface.

Development

Create a new task

To create a Task, you must define a module that contains a "Task" class. Here is a simple example:

# irisml/tasks/my_custom_task.py
import dataclasses
import irisml.core

class Task(irisml.core.TaskBase):  # The class name must be "Task".
  VERSION = '1.0.0'
  CACHE_ENABLED = True  # (default: True) This is optional.

  @dataclasses.dataclass
  class Inputs:  # You can remove this class if the task doesn't require inputs.
    int_value: int
    float_value: float

  @dataclasses.dataclass
  class Config:  # If there is no configuration, you can remove this class. All fields must be JSON-serializable.
    another_float: float
    child_dataclass: dataclass  # If you'd like to define a nested config, you can define another dataclass.

  @dataclasses.dataclass
  class Outputs:  # Can be removed if the task doesn't have outputs.
    float_value: float = 0  # If dry_run() is not implemented, Outputs fields must have default value or default factory.

  def execute(self, inputs: Inputs) -> Outputs:
    return self.Outputs(inputs.int_value * inputs.float_value * self.config.another_float)

  def dry_run(self, inputs: Inputs) -> Outputs:  # This method is optional.
    return self.Outputs(0)  # Must return immediately without actual processing.

Each Task must define "execute" method. The base class has empty implementation for Inputs, Config, Outputs and dry_run(). For the detail, please see the document for TaskBase class.

Related repositories

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.35

Dec 2, 2023

0.0.34

Nov 17, 2023

0.0.33

Jul 21, 2023

0.0.32

Jun 30, 2023

0.0.31

Jun 28, 2023

0.0.30

Jun 22, 2023

0.0.29

Jun 15, 2023

0.0.28

May 23, 2023

0.0.27

May 20, 2023

0.0.26

May 18, 2023

0.0.25

May 4, 2023

0.0.24

May 2, 2023

0.0.23

Apr 29, 2023

0.0.22

Apr 14, 2023

0.0.21

Mar 21, 2023

0.0.20

Mar 17, 2023

0.0.19

Mar 17, 2023

0.0.18

Mar 16, 2023

0.0.17

Feb 14, 2023

0.0.16

Dec 2, 2022

0.0.15

Nov 29, 2022

0.0.14

Nov 23, 2022

0.0.13

Oct 14, 2022

0.0.12

Oct 7, 2022

0.0.11

Oct 4, 2022

0.0.10

Sep 19, 2022

0.0.9

Sep 13, 2022

0.0.8

Sep 12, 2022

0.0.7

Aug 20, 2022

0.0.6

Aug 17, 2022

0.0.5

Aug 17, 2022

0.0.4

Aug 2, 2022

0.0.3

Jul 16, 2022

0.0.2

Jul 16, 2022

0.0.1

Jun 29, 2022

0.0.0

Jun 22, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

irisml-0.0.35.tar.gz (34.9 kB view hashes)

Uploaded Dec 2, 2023 Source

Built Distribution

irisml-0.0.35-py3-none-any.whl (31.3 kB view hashes)

Uploaded Dec 2, 2023 Python 3

Hashes for irisml-0.0.35.tar.gz

Hashes for irisml-0.0.35.tar.gz
Algorithm	Hash digest
SHA256	`2bf1050cc4a86a8520091218f4ef7d3cc56b00918fb7949dc7870bab4d6c1ded`
MD5	`915ff3e52ab6d967cc80b31148be4582`
BLAKE2b-256	`9051682be01d724366512a1f473a82b713351bde049d2274f55261e158a9c2d6`

Hashes for irisml-0.0.35-py3-none-any.whl

Hashes for irisml-0.0.35-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a6b59c5590b33fa890be3774b82142e3f5d9b06e805e49ed3992226290ea8a32`
MD5	`2e0d3b306b62d29de554a1f25ad45a00`
BLAKE2b-256	`28392bde069d687730ba97cc084f368d159f80f45b8aab94402108206832666b`