Integrate customer side ML application with the Alectio Platform
Project description
Getting started with AlectioSDK
Setup docs »
Alectio Examples »
How to create a Project »
How to create an Experiment »
How to do Active learning »
Request a Feature »
Table of Contents
About The Project
AlectioSDK is an open source python package that provides end users the facility to perform Data curation for training Machine Learning models. The package is model/dataset/framework agnostic. It is simple , efficient and works on-prem so that your ML training can be carried out smoothly, securely and most importantly efficiently.
Built With
The SDK is mainly built using the following
- Python
- [GRPC] (https://github.com/grpc/grpc)
- [PyTorch] (https://pytorch.org/docs/stable/index.html)
Getting Started
Prerequisites
- Python3 (Required)
- PIP3 (Required)
- Ubuntu 16.04+ / MacOS / Windows 10
- GCC / C++ (Will depend on OS you are using. Ubuntu, MacOS it comes default. Some falvours of linux distribution like Amazon Linux/RED Hat linux might not have GCC or C++ realted libraries installed)
Installation
Please follow instructions here to get started with the installation. If you want to give a quick trial run with our ready to go example, follow the instructions here
Custom Usage
Once you are done installing the above you are good to go. If you want to run one of our examples you should follow the remaining installation instructions detailed in the examples directory. We cover examples for topic classification, for image classification and for object detection.
Required implementations
To use AlectioSDK on a custom model/dataset you are required to create the following two python files inside your model implementation folder. Lets say for you have your model implemented in folder Detector
.
├── Detector
│ ├── processes.py
│ ├── main.py
│ └── config.yaml
│ └── < other model dependancy files/Folders >
processes.py
AlectioSDK needs you to implement 4 main processes so that the SDK can be used to perform model and data optimizations using your own machine learning model and your data:
- Training process - A process to train the model
- Testing process - A process to test the model
- Infer process - A process to apply the model to infer on unlabeled data
- Datamap process - A process to assign each data point in the dataset to a unique index (Refer to one of the examples to know how)
Training process
The training logic to train your ML model needs to implemented here. The structure of the function should look like below The logic for training the model should be implemented by you in this process. The function should look like below and should contain the following items:
def train(args, labeled, resume_from, ckpt_file):
"""
Get list of indices of the data to be trained in your dataloader.
Note the SDK will pause for you to label these before you can train
"""
labeledindices = labeled
"""
Get checkpoint to resume from. Since the training is incremental
you may or maynot choose to resume from the previous checkpoint.
We suggest you not to resume from previous checkpoint to remove
previous model biases during training
"""
resumeckpt = resume_from
"""
Use the name given by the SDK here to save your best model , this
checkpoint is what will be used by other processes in this file
"""
ckpt_file = ckpt_file
# implement your logic to train the model
# with the selected data indexed by `labeled`
return
The name of the function can be anything you like. It takes the following arguments
Arguments | Description |
---|---|
args | a yaml file set up at your model folder to take model arguments |
resume_from | a string that specifies which checkpoint to resume from, we suggest you to clear weights and not resume from previous loop checkpoint |
ckpt_file | a string that specifies the name of checkpoint to be saved for the current loop, use this string to save your best checkpoint using your logic on what you consider as the best checkpoint |
labeled | a list of indices of selected samples used to train the model in this loop |
Depending on your situation, the samples indicated in labeled
might not be labeled (despite the variable
name). We call it labeled
because in the active learning setting, this list represents the pool of
samples iteratively labeled by the human oracle. Make sure you label these data points when the SDK pauses after each loop.
Testing process
The logic for testing the model should be implemented in this process. The function representing this process should look like:
def test(args, ckpt_file):
# the checkpoint to test
ckpt_file = ckpt_file
# implement your testing logic here
# put the predictions and labels into
# two dictionaries
# lbs <- dictionary of indices of test data and their ground-truth
# prd <- dictionary of indices of test data and their prediction
return {'predictions': prd, 'labels': lbs}
The test function takes 2 arguments as below
Arguments | Description |
---|---|
args | a yaml file set up at your model folder to take model arguments |
ckpt_file | a string that specifies checkpoint that should be used for testing |
The test function needs to return a dictionary with two keys
key | value |
---|---|
predictions | a dictionary of an index and a prediction for each test sample |
labels | a dictionary of an index and a ground truth label for each test sample |
The format of the values depends on the type of ML problem. Please refer to the examples directory for details regarding what is needed for each usecase.
Infer process
The logic for applying the model to infer on the unlabeled data should be implemented in this process. The function representing this process should look like:
def infer(args, unlabeled,ckpt_file):
"""
Get the list of indices of unlabeled data. Use these indices in your
dataloader to infer on unlabeled pool of data
"""
unlabeledindices = unlabeled
# get the checkpoint file to be used for applying inference
ckpt_file = ckpt_file
# implement your inference logic here
"""
outputs <- save the output from the model on the unlabeled data
as a dictionary
"""
return {'outputs': outputs}
The infer function takes an argument the following 2 arguments
key | value |
---|---|
ckpt_file | a string that specifies which checkpoint to use to infer on the unlabeled data |
unlabeled | a list of of indices of unlabeled data in the training set |
The infer
function needs to return a dictionary with one key
key | value |
---|---|
outputs | a dictionary of indexes mapped to the models output before an activation function is applied |
For example, if it is a classification problem, return the output before applying softmax. For more details about the format of the output, please refer to the examples directory.
Datamap process
The logic that creates a reference table to refer to all of your data The function representing this process should look like:
def getdatasetstate(args):
"""
Create a loader object with 100 % of your data i.e all of labelled, unlabelled
data that you are planning to use for this experiment
"""
loader = DataLoader(datasetobject) #Note this can be done in any framework that you choose to use , this is just an example
trainreference = {}
for ix, pathtorecord, _ in loader:
trainreference[ix] = pathtorecord
return trainreference
The getdatasetstate
function needs to return a dictionary which contains index mapped to path of the record
main.py
Use the main.py to feed in all 4 above processes into AlectioSDK Refer to one of our examples and it will be easier for you to mimic the same. All you need now is an experiment name that you can come up with and unique experiment token that you can get from Alectio's FE
#Sample format
#Import necessary modules
from alectio_sdk.sdk import Pipeline
from processes import train, test, infer, getdatasetstate
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--config','-c',
type= str,
required =True,
default = './config.yaml',
help = 'Config to use to trigger AlectioSDK')
params, _ = parser.parse_known_args()
args = yaml.safe_load(open(params.config,'r'))
AlectioPipeline = Pipeline(
name=args["exp_name"], # your experiment name , can be anything of type str
train_fn=train,
test_fn=test,
infer_fn=infer,
getstate_fn=getdatasetstate,
args=args,
token="<alectio-experiment-token>", # your unique token of type str
)
AlectioPipeline()
config.yaml (optional)
Put in all the requirements that are required for the model to train. This will be read and used in processes.py when the model trains. For example if config.yaml looks like this
exptname: "ManualAL"
# Model configs
backbone: "Resnet101"
description: "Pedestrian detection"
...
..
you can access them inside your any of the above 4 processes as lets say args["backbone"] , args["description"] etc
Experiment reproduceability
AlectioSDK is just an orchestrator to orchestrate your training process on-prem. Please be adviced that randomness has a significant impact on your results. We have included utilities that you can use to reduce randomness significantly. But in some cases like Pytorch randomness in certain network parameters still exists. We advice you to use the following guidelines to reduce this as much as possible if you want to see reproduceable results
Pytorch
To ensure reproduceability in Pytorch use the following
- set num_workers = 0 in your dataloader
- Inside your processes.py file ensure the following things exist (Note: This may make your training slow, we trust you to assess the tradeoff between speed and reproduceability. Using distributed training can also induce randomness in your training thereby affecting the end result of your experiments for the better or the worst)
from alectio_sdk.torch_utils.utils import setpytorchreproduceability
def train(args, labeled, resume_from, ckpt_file):
# Ensuring reproduceability
setpytorchreproduceability(seed= 42)
"""
Get list of indices of the data to be trained in your dataloader.
Note the SDK will pause for you to label these before you can train
"""
labeledindices = labeled
"""
Get checkpoint to resume from. Since the training is incremental
you may or maynot choose to resume from the previous checkpoint.
We suggest you not to resume from previous checkpoint to remove
previous model biases during training
"""
resumeckpt = resume_from
"""
Use the name given by the SDK here to save your best model , this
checkpoint is what will be used by other processes in this file
"""
ckpt_file = ckpt_file
# implement your logic to train the model
# with the selected data indexed by `labeled`
return
Tensorflow
To ensure reproduceability in Tensorflow use the following
- Inside your processes.py file ensure the following things exist (Note: This may make your training slow, we trust you to assess the tradeoff between speed and reproduceability. Using distributed training can also induce randomness in your training thereby affecting the end result of your experiments for the better or the worst)
from alectio_sdk.tensorflow_utils.utils import settfreproduceability
def train(args, labeled, resume_from, ckpt_file):
# Ensuring reproduceability
settfreproduceability(seed= 42)
"""
Get list of indices of the data to be trained in your dataloader.
Note the SDK will pause for you to label these before you can train
"""
labeledindices = labeled
"""
Get checkpoint to resume from. Since the training is incremental
you may or maynot choose to resume from the previous checkpoint.
We suggest you not to resume from previous checkpoint to remove
previous model biases during training
"""
resumeckpt = resume_from
"""
Use the name given by the SDK here to save your best model , this
checkpoint is what will be used by other processes in this file
"""
ckpt_file = ckpt_file
# implement your logic to train the model
# with the selected data indexed by `labeled`
return
Open issues
See the open issues for a list of known issues.
Contributing
Contributions are greatly appreciated.
Contact
Twitter - @alectiolessdata
Email - info@alectio.com
Acknowledgements
- Almost all of our Examples are open sourced well known models. We don't claim ownership for any of those.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for alectio_sdk-0.6.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 492b0b167161d76aaa60b7c9637d250101fb27520ae9f53c5e0bf178a20992a7 |
|
MD5 | 26acf4c568d1138fb1616b130c4ba6af |
|
BLAKE2b-256 | e2a221aef26650883d90f2afbbb7f18ec2e6cb3338f75feadc73cc8e6f550f0e |