alectio-sdk

Integrate customer side ML application with the Alectio Platform

These details have been verified by PyPI

Maintainers

abhimanyu_jadli alectio hanju.kim sawan.sihag

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Getting started with AlectioSDK

Info Hub

Who are we »
Setup docs »
Alectio Examples »
How to create a Project »
How to create an Experiment »
How to do Active learning »
Request a Feature »

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Custom Usage
Open issues
Contributing
Contact
Acknowledgements

About The Project

AlectioSDK is an open source python package that provides end users the facility to perform Data curation for training Machine Learning models. The package is model/dataset/framework agnostic. It is simple , efficient and works on-prem so that your ML training can be carried out smoothly, securely and most importantly efficiently.

Built With

The SDK is mainly built using the following

Python
[GRPC] (https://github.com/grpc/grpc)
[PyTorch] (https://pytorch.org/docs/stable/index.html)

Getting Started

Prerequisites

Python3 (Required)
PIP3 (Required)
Ubuntu 16.04+ / MacOS / Windows 10
GCC / C++ (Will depend on OS you are using. Ubuntu, MacOS it comes default. Some falvours of linux distribution like Amazon Linux/RED Hat linux might not have GCC or C++ realted libraries installed)

Installation

Please follow instructions here to get started with the installation. If you want to give a quick trial run with our ready to go example, follow the instructions here

Custom Usage

Once you are done installing the above you are good to go. If you want to run one of our examples you should follow the remaining installation instructions detailed in the examples directory. We cover examples for topic classification, for image classification and for object detection.

Required implementations

To use AlectioSDK on a custom model/dataset you are required to create the following two python files inside your model implementation folder. Lets say for you have your model implemented in folder Detector

  .
  ├── Detector
  │            ├── processes.py
  │            ├── main.py
  │            └── config.yaml
  │            └── < other model dependancy files/Folders >

processes.py

AlectioSDK needs you to implement 4 main processes so that the SDK can be used to perform model and data optimizations using your own machine learning model and your data:

Training process - A process to train the model
Testing process - A process to test the model
Infer process - A process to apply the model to infer on unlabeled data
Datamap process - A process to assign each data point in the dataset to a unique index (Refer to one of the examples to know how)

Training process

The training logic to train your ML model needs to implemented here. The structure of the function should look like below The logic for training the model should be implemented by you in this process. The function should look like below and should contain the following items:

def train(args, labeled, resume_from, ckpt_file):
    """
    Get list of indices of the data to be trained in your dataloader.
    Note the SDK will pause for you to label these before you can train
    """
    labeledindices = labeled

    """
    Get checkpoint to resume from. Since the training is incremental 
    you may or maynot choose to resume from the previous checkpoint. 
    We suggest you not to resume from previous checkpoint to remove 
    previous model biases during training
    """
    resumeckpt = resume_from 

    """
    Use the name given by the SDK here to save your best model , this
    checkpoint is what will be used by other processes in this file
    """
    ckpt_file = ckpt_file

    # implement your logic to train the model
    # with the selected data indexed by `labeled`
    return

The name of the function can be anything you like. It takes the following arguments

Arguments	Description
args	a yaml file set up at your model folder to take model arguments
resume_from	a string that specifies which checkpoint to resume from, we suggest you to clear weights and not resume from previous loop checkpoint
ckpt_file	a string that specifies the name of checkpoint to be saved for the current loop, use this string to save your best checkpoint using your logic on what you consider as the best checkpoint
labeled	a list of indices of selected samples used to train the model in this loop

Depending on your situation, the samples indicated in labeled might not be labeled (despite the variable name). We call it labeled because in the active learning setting, this list represents the pool of samples iteratively labeled by the human oracle. Make sure you label these data points when the SDK pauses after each loop.

Testing process

The logic for testing the model should be implemented in this process. The function representing this process should look like:

def test(args, ckpt_file):
    # the checkpoint to test
    ckpt_file = ckpt_file

    # implement your testing logic here


    # put the predictions and labels into
    # two dictionaries

    # lbs <- dictionary of indices of test data and their ground-truth

    # prd <- dictionary of indices of test data and their prediction

    return {'predictions': prd, 'labels': lbs}

The test function takes 2 arguments as below

Arguments	Description
args	a yaml file set up at your model folder to take model arguments
ckpt_file	a string that specifies checkpoint that should be used for testing

The test function needs to return a dictionary with two keys

key	value
predictions	a dictionary of an index and a prediction for each test sample
labels	a dictionary of an index and a ground truth label for each test sample

The format of the values depends on the type of ML problem. Please refer to the examples directory for details regarding what is needed for each usecase.

Infer process

The logic for applying the model to infer on the unlabeled data should be implemented in this process. The function representing this process should look like:

def infer(args, unlabeled,ckpt_file):
    """
    Get the list of indices of unlabeled data. Use these indices in your 
    dataloader to infer on unlabeled pool of data
    """
    unlabeledindices = unlabeled

    # get the checkpoint file to be used for applying inference
    ckpt_file = ckpt_file

    # implement your inference logic here

    """
    outputs <- save the output from the model on the unlabeled data 
    as a dictionary
    """
    return {'outputs': outputs}

The infer function takes an argument the following 2 arguments

key	value
ckpt_file	a string that specifies which checkpoint to use to infer on the unlabeled data
unlabeled	a list of of indices of unlabeled data in the training set

The infer function needs to return a dictionary with one key

key	value
outputs	a dictionary of indexes mapped to the models output before an activation function is applied

For example, if it is a classification problem, return the output before applying softmax. For more details about the format of the output, please refer to the examples directory.

Datamap process

The logic that creates a reference table to refer to all of your data The function representing this process should look like:

def getdatasetstate(args):
    """
    Create a loader object with 100 % of your data i.e all of labelled, unlabelled 
    data that you are planning to use for this experiment
    """
    loader = DataLoader(datasetobject)     #Note this can be done in any framework that you choose to use , this is just an example
    trainreference = {}
    for ix, pathtorecord, _ in loader:
        trainreference[ix] = pathtorecord

    return trainreference

The getdatasetstate function needs to return a dictionary which contains index mapped to path of the record

main.py

Use the main.py to feed in all 4 above processes into AlectioSDK Refer to one of our examples and it will be easier for you to mimic the same. All you need now is an experiment name that you can come up with and unique experiment token that you can get from Alectio's FE

#Sample format
#Import necessary modules

from alectio_sdk.sdk import Pipeline
from processes import train, test, infer, getdatasetstate

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--config','-c',
    type= str,
    required =True,
    default = './config.yaml',
    help = 'Config to use to trigger AlectioSDK')
    params, _ = parser.parse_known_args()
    args = yaml.safe_load(open(params.config,'r'))

    AlectioPipeline = Pipeline(
    name=args["exp_name"],                  # your experiment name , can be anything of type str
    train_fn=train,
    test_fn=test,
    infer_fn=infer,
    getstate_fn=getdatasetstate,
    args=args,
    token="<alectio-experiment-token>",   # your unique token of type str
    )
    AlectioPipeline()

config.yaml (optional)

Put in all the requirements that are required for the model to train. This will be read and used in processes.py when the model trains. For example if config.yaml looks like this

exptname:  "ManualAL"         
# Model configs
backbone:     "Resnet101"
description:  "Pedestrian detection"
...
..

you can access them inside your any of the above 4 processes as lets say args["backbone"] , args["description"] etc

Experiment reproduceability

AlectioSDK is just an orchestrator to orchestrate your training process on-prem. Please be adviced that randomness has a significant impact on your results. We have included utilities that you can use to reduce randomness significantly. But in some cases like Pytorch randomness in certain network parameters still exists. We advice you to use the following guidelines to reduce this as much as possible if you want to see reproduceable results

Pytorch

To ensure reproduceability in Pytorch use the following

set num_workers = 0 in your dataloader
Inside your processes.py file ensure the following things exist (Note: This may make your training slow, we trust you to assess the tradeoff between speed and reproduceability. Using distributed training can also induce randomness in your training thereby affecting the end result of your experiments for the better or the worst)

from alectio_sdk.torch_utils.utils import setpytorchreproduceability

def train(args, labeled, resume_from, ckpt_file):

    # Ensuring reproduceability
    setpytorchreproduceability(seed= 42)
    """
    Get list of indices of the data to be trained in your dataloader.
    Note the SDK will pause for you to label these before you can train
    """
    labeledindices = labeled

    """
    Get checkpoint to resume from. Since the training is incremental 
    you may or maynot choose to resume from the previous checkpoint. 
    We suggest you not to resume from previous checkpoint to remove 
    previous model biases during training
    """
    resumeckpt = resume_from 

    """
    Use the name given by the SDK here to save your best model , this
    checkpoint is what will be used by other processes in this file
    """
    ckpt_file = ckpt_file

    # implement your logic to train the model
    # with the selected data indexed by `labeled`
    return

Tensorflow

To ensure reproduceability in Tensorflow use the following

Inside your processes.py file ensure the following things exist (Note: This may make your training slow, we trust you to assess the tradeoff between speed and reproduceability. Using distributed training can also induce randomness in your training thereby affecting the end result of your experiments for the better or the worst)

from alectio_sdk.tensorflow_utils.utils import settfreproduceability

def train(args, labeled, resume_from, ckpt_file):

    # Ensuring reproduceability
    settfreproduceability(seed= 42)
    """
    Get list of indices of the data to be trained in your dataloader.
    Note the SDK will pause for you to label these before you can train
    """
    labeledindices = labeled

    """
    Get checkpoint to resume from. Since the training is incremental 
    you may or maynot choose to resume from the previous checkpoint. 
    We suggest you not to resume from previous checkpoint to remove 
    previous model biases during training
    """
    resumeckpt = resume_from 

    """
    Use the name given by the SDK here to save your best model , this
    checkpoint is what will be used by other processes in this file
    """
    ckpt_file = ckpt_file

    # implement your logic to train the model
    # with the selected data indexed by `labeled`
    return

Open issues

See the open issues for a list of known issues.

Contributing

Contributions are greatly appreciated.

Contact

Twitter - @alectiolessdata
Email - info@alectio.com

Acknowledgements

Almost all of our Examples are open sourced well known models. We don't claim ownership for any of those.

Project details

These details have been verified by PyPI

Maintainers

abhimanyu_jadli alectio hanju.kim sawan.sihag

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.35

Dec 27, 2023

1.0.34

Dec 27, 2023

1.0.33

Sep 29, 2023

1.0.32

Sep 29, 2023

1.0.31

Sep 29, 2023

1.0.3

Sep 28, 2023

1.0.2

Feb 20, 2023

1.0.1

Feb 20, 2023

0.7.7

Feb 11, 2023

0.7.6

Feb 11, 2023

0.7.5

Feb 10, 2023

0.7.4

Dec 29, 2022

0.7.3

Dec 29, 2022

0.6.21

Jan 4, 2022

0.6.20

Dec 29, 2021

0.6.19

Dec 15, 2021

0.6.18

Dec 15, 2021

0.6.15

Apr 13, 2021

0.6.14

Apr 9, 2021

0.6.13

Mar 31, 2021

0.6.12

Mar 4, 2021

This version

0.6.11 yanked

Mar 4, 2021

0.6.8

Dec 26, 2020

0.6.4

Nov 20, 2020

0.6.3

Nov 20, 2020

0.6.2

Nov 20, 2020

0.0.3

Sep 5, 2020

0.0.2

Sep 5, 2020

0.0.1

Sep 4, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

alectio_sdk-0.6.11-py3-none-any.whl (85.1 kB view hashes)

Uploaded Mar 4, 2021 Python 3

Hashes for alectio_sdk-0.6.11-py3-none-any.whl

Hashes for alectio_sdk-0.6.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e5ba9a572a974ee148f4455ae9f5d2bf1de31147bcf3ad748fa7e14279e38071`
MD5	`43053838b684f459d33449d537f5b44f`
BLAKE2b-256	`76796160144c958991a101aea80db1e4b990952fbca379d4cb89326157330547`

alectio-sdk 0.6.11

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Getting started with AlectioSDK

About The Project

Built With

Getting Started

Prerequisites

Installation

Custom Usage

Required implementations

processes.py

Training process

Testing process

Infer process

Datamap process

main.py

config.yaml (optional)

Experiment reproduceability

Pytorch

Tensorflow

Open issues

Contributing

Contact

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution