Skip to main content

Integrate customer side ML application with the Alectio Platform

Project description

Medium LinkedIn Python Opensource


Logo

Getting started with AlectioSDK

Info Hub Who are we »
Setup docs »
Alectio Examples »
How to create a Project »
How to create an Experiment »
How to do Active learning »
Request a Feature »

Table of Contents
  1. About The Project
  2. Getting Started
  3. Custom Usage
  4. Open issues
  5. Contributing
  6. Contact
  7. Acknowledgements

About The Project

Product Name Screen Shot

AlectioSDK is an open source python package that provides end users the facility to perform Data curation for training Machine Learning models. The package is model/dataset/framework agnostic. It is simple , efficient and works on-prem so that your ML training can be carried out smoothly, securely and most importantly efficiently.

Built With

The SDK is mainly built using the following

Getting Started

Prerequisites

  • Python3 (Required)
  • PIP3 (Required)
  • Ubuntu 16.04+ / MacOS / Windows 10
  • GCC / C++ (Will depend on OS you are using. Ubuntu, MacOS it comes default. Some falvours of linux distribution like Amazon Linux/RED Hat linux might not have GCC or C++ realted libraries installed)

Installation

Please follow instructions here to get started with the installation. If you want to give a quick trial run with our ready to go example, follow the instructions here

Custom Usage

Once you are done installing the above you are good to go. If you want to run one of our examples you should follow the remaining installation instructions detailed in the examples directory. We cover examples for topic classification, for image classification and for object detection.

Required implementations

To use AlectioSDK on a custom model/dataset you are required to create the following two python files inside your model implementation folder. Lets say for you have your model implemented in folder Detector

  .
  ├── Detector
  │            ├── processes.py
  │            ├── main.py
  │            └── config.yaml
  │            └── < other model dependancy files/Folders >

processes.py

AlectioSDK needs you to implement 4 main processes so that the SDK can be used to perform model and data optimizations using your own machine learning model and your data:

  • Training process - A process to train the model
  • Testing process - A process to test the model
  • Infer process - A process to apply the model to infer on unlabeled data
  • Datamap process - A process to assign each data point in the dataset to a unique index (Refer to one of the examples to know how)
Training process

The training logic to train your ML model needs to implemented here. The structure of the function should look like below The logic for training the model should be implemented by you in this process. The function should look like below and should contain the following items:

def train(args, labeled, resume_from, ckpt_file):
    """
    Get list of indices of the data to be trained in your dataloader.
    Note the SDK will pause for you to label these before you can train
    """
    labeledindices = labeled

    """
    Get checkpoint to resume from. Since the training is incremental 
    you may or maynot choose to resume from the previous checkpoint. 
    We suggest you not to resume from previous checkpoint to remove 
    previous model biases during training
    """
    resumeckpt = resume_from 

    """
    Use the name given by the SDK here to save your best model , this
    checkpoint is what will be used by other processes in this file
    """
    ckpt_file = ckpt_file

    # implement your logic to train the model
    # with the selected data indexed by `labeled`
    return

The name of the function can be anything you like. It takes the following arguments

Arguments Description
args a yaml file set up at your model folder to take model arguments
resume_from a string that specifies which checkpoint to resume from, we suggest you to clear weights and not resume from previous loop checkpoint
ckpt_file a string that specifies the name of checkpoint to be saved for the current loop, use this string to save your best checkpoint using your logic on what you consider as the best checkpoint
labeled a list of indices of selected samples used to train the model in this loop

Depending on your situation, the samples indicated in labeled might not be labeled (despite the variable name). We call it labeled because in the active learning setting, this list represents the pool of samples iteratively labeled by the human oracle. Make sure you label these data points when the SDK pauses after each loop.

Testing process

The logic for testing the model should be implemented in this process. The function representing this process should look like:

def test(args, ckpt_file):
    # the checkpoint to test
    ckpt_file = ckpt_file

    # implement your testing logic here


    # put the predictions and labels into
    # two dictionaries

    # lbs <- dictionary of indices of test data and their ground-truth

    # prd <- dictionary of indices of test data and their prediction

    return {'predictions': prd, 'labels': lbs}

The test function takes 2 arguments as below

Arguments Description
args a yaml file set up at your model folder to take model arguments
ckpt_file a string that specifies checkpoint that should be used for testing

The test function needs to return a dictionary with two keys

key value
predictions a dictionary of an index and a prediction for each test sample
labels a dictionary of an index and a ground truth label for each test sample

The format of the values depends on the type of ML problem. Please refer to the examples directory for details regarding what is needed for each usecase.

Infer process

The logic for applying the model to infer on the unlabeled data should be implemented in this process. The function representing this process should look like:

def infer(args, unlabeled,ckpt_file):
    """
    Get the list of indices of unlabeled data. Use these indices in your 
    dataloader to infer on unlabeled pool of data
    """
    unlabeledindices = unlabeled

    # get the checkpoint file to be used for applying inference
    ckpt_file = ckpt_file

    # implement your inference logic here

    """
    outputs <- save the output from the model on the unlabeled data 
    as a dictionary
    """
    return {'outputs': outputs}

The infer function takes an argument the following 2 arguments

key value
ckpt_file a string that specifies which checkpoint to use to infer on the unlabeled data
unlabeled a list of of indices of unlabeled data in the training set

The infer function needs to return a dictionary with one key

key value
outputs a dictionary of indexes mapped to the models output before an activation function is applied

For example, if it is a classification problem, return the output before applying softmax. For more details about the format of the output, please refer to the examples directory.

Datamap process

The logic that creates a reference table to refer to all of your data The function representing this process should look like:

def getdatasetstate(args):
    """
    Create a loader object with 100 % of your data i.e all of labelled, unlabelled 
    data that you are planning to use for this experiment
    """
    loader = DataLoader(datasetobject)     #Note this can be done in any framework that you choose to use , this is just an example
    trainreference = {}
    for ix, pathtorecord, _ in loader:
        trainreference[ix] = pathtorecord

    return trainreference

The getdatasetstate function needs to return a dictionary which contains index mapped to path of the record

main.py

Use the main.py to feed in all 4 above processes into AlectioSDK Refer to one of our examples and it will be easier for you to mimic the same. All you need now is an experiment name that you can come up with and unique experiment token that you can get from Alectio's FE

#Sample format
#Import necessary modules

from alectio_sdk.sdk import Pipeline
from processes import train, test, infer, getdatasetstate

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--config','-c',
    type= str,
    required =True,
    default = './config.yaml',
    help = 'Config to use to trigger AlectioSDK')
    params, _ = parser.parse_known_args()
    args = yaml.safe_load(open(params.config,'r'))

    AlectioPipeline = Pipeline(
    name=args["exp_name"],                  # your experiment name , can be anything of type str
    train_fn=train,
    test_fn=test,
    infer_fn=infer,
    getstate_fn=getdatasetstate,
    args=args,
    token="<alectio-experiment-token>",   # your unique token of type str
    )
    AlectioPipeline()

config.yaml (optional)

Put in all the requirements that are required for the model to train. This will be read and used in processes.py when the model trains. For example if config.yaml looks like this

exptname:  "ManualAL"         
# Model configs
backbone:     "Resnet101"
description:  "Pedestrian detection"
...
..

you can access them inside your any of the above 4 processes as lets say args["backbone"] , args["description"] etc

Experiment reproduceability

AlectioSDK is just an orchestrator to orchestrate your training process on-prem. Please be adviced that randomness has a significant impact on your results. We have included utilities that you can use to reduce randomness significantly. But in some cases like Pytorch randomness in certain network parameters still exists. We advice you to use the following guidelines to reduce this as much as possible if you want to see reproduceable results

Pytorch

To ensure reproduceability in Pytorch use the following

  • set num_workers = 0 in your dataloader
  • Inside your processes.py file ensure the following things exist (Note: This may make your training slow, we trust you to assess the tradeoff between speed and reproduceability. Using distributed training can also induce randomness in your training thereby affecting the end result of your experiments for the better or the worst)
from alectio_sdk.torch_utils.utils import setpytorchreproduceability

def train(args, labeled, resume_from, ckpt_file):

    # Ensuring reproduceability
    setpytorchreproduceability(seed= 42)
    """
    Get list of indices of the data to be trained in your dataloader.
    Note the SDK will pause for you to label these before you can train
    """
    labeledindices = labeled

    """
    Get checkpoint to resume from. Since the training is incremental 
    you may or maynot choose to resume from the previous checkpoint. 
    We suggest you not to resume from previous checkpoint to remove 
    previous model biases during training
    """
    resumeckpt = resume_from 

    """
    Use the name given by the SDK here to save your best model , this
    checkpoint is what will be used by other processes in this file
    """
    ckpt_file = ckpt_file

    # implement your logic to train the model
    # with the selected data indexed by `labeled`
    return

Tensorflow

To ensure reproduceability in Tensorflow use the following

  • Inside your processes.py file ensure the following things exist (Note: This may make your training slow, we trust you to assess the tradeoff between speed and reproduceability. Using distributed training can also induce randomness in your training thereby affecting the end result of your experiments for the better or the worst)
from alectio_sdk.tensorflow_utils.utils import settfreproduceability

def train(args, labeled, resume_from, ckpt_file):

    # Ensuring reproduceability
    settfreproduceability(seed= 42)
    """
    Get list of indices of the data to be trained in your dataloader.
    Note the SDK will pause for you to label these before you can train
    """
    labeledindices = labeled

    """
    Get checkpoint to resume from. Since the training is incremental 
    you may or maynot choose to resume from the previous checkpoint. 
    We suggest you not to resume from previous checkpoint to remove 
    previous model biases during training
    """
    resumeckpt = resume_from 

    """
    Use the name given by the SDK here to save your best model , this
    checkpoint is what will be used by other processes in this file
    """
    ckpt_file = ckpt_file

    # implement your logic to train the model
    # with the selected data indexed by `labeled`
    return

Open issues

See the open issues for a list of known issues.

Contributing

Contributions are greatly appreciated.

Contact

Twitter - @alectiolessdata
Email - info@alectio.com

Acknowledgements

  • Almost all of our Examples are open sourced well known models. We don't claim ownership for any of those.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

alectio_sdk-0.6.11-py3-none-any.whl (85.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page