A framework for building and experimenting with multimodal models in continual learning scenarios
Project description
OctopusCL
Introduction
OctopusCL is a framework for building and experimenting with multimodal models in continual learning scenarios. It is composed of the following components:
- Dataset Manager: A tool for managing datasets in a centralized repository called Dataset Repository.
- Experiment Manager: A tool for running and tracking experiments in a centralized registry called Experiment Registry.
- Model Manager (not available yet)
Installation
You can install OctopusCL with pip:
pip install octopuscl-lib
Usage
This section provides all the necessary instructions and examples to effectively use OctopusCL, covering dataset building and management, and experiment building, execution, and tracking.
Environments
OctopusCL can run in three different environments:
- Development: The local environment for developing, debugging and testing new code.
- Staging: The staging environment is a mirror of the production environment. It is used for testing new code before deploying it to production, providing a final check to ensure that everything behaves as expected in a production-like environment.
- Production: The production environment is the live environment where the final code is deployed.
Dataset Manager
The Dataset Manager allows for uploading and downloading datasets to and from the Dataset Repository. It is
accessible through the
octopuscl/scripts/run_dataset_manager.py
script (more information about the arguments on the Managing datasets section below).
Concepts
- Dataset. A collection of data that is used to train and evaluate AI models. It is composed of the following parts:
- Schema. The dataset schema is a vital component that guarantees the structure and format of data are consistent. Beyond providing basic information such as the dataset name or description, the schema specifies the inputs, outputs, and metadata fields that AI models will use to train and make predictions.
- Examples. The actual data to be used to train and evaluate AI models. They must adhere to the dataset schema.
- Files: Optional files that may be referenced by the examples. They can be images, audio files, documents, or any other types of files that AI models need to process.
- Splits: Optional pre-defined splits that determine how examples are distributed across experiences and partitions (training, test, validation).
- Dataset Repository. A centralized storage solution, hosted on Amazon S3, where all datasets are stored. It serves as the backbone for dataset management, providing a scalable and secure location for storing and retrieving datasets.
Building datasets
A dataset must be stored in a dedicated directory that contains the following files and directories:
schema.json: The JSON file with the dataset schema.examples.csvorexamples.db: The CSV file or SQLite database containing the examples of the dataset.files(optional): The directory that contains all the files referenced by the examples.splits(optional): A directory containing pre-defined splits that determine how examples are distributed across experiences and partitions (training, test, validation).
See docs/datasets.md for detailed instructions on how to build datasets.
Managing datasets
To upload or download a dataset, run
octopuscl/scripts/run_dataset_manager.py
with the following arguments:
-e, --environment(required): The environment where the dataset will be uploaded or downloaded. It can bedevelopment,staging, orproduction.-a, --action(required): Eitheruploadordownload.-l, --local_path(required): Path to the local file or directory.-d, --dataset(optional): Name of the dataset. Required when downloading.-r, --remote_path(optional): Path to the remote file or directory. Required when downloading. Ensure it ends with a/for directories.
Examples:
-
Uploading a dataset.
cd octopuscl/scripts ./run_dataset_manager.py -e production -a upload -l /path/to/local/dataset/
-
Downloading a dataset.
cd octopuscl/scripts ./run_dataset_manager.py -e production -a download -l /local/path/ -d dataset_name -r /remote/path/
Experiment Manager
The Experiment Manager allows for running and tracking experiments. It is accessible through the
octopuscl/scripts/run_experiments.py
script (more information about the arguments on the Running experiments section below).
Concepts
The Experiment Manager is built upon the following concepts:
- Experiment: The process of evaluating a set of AI models, pipelines, or workflows under specific conditions. An
experiment should define a clear objective that must be common to all the executions belonging to the experiment.
- Datasets: The set of datasets used in the experiment.
- Splitter: The method used to split the datasets into training, validation, and test sets.
- Metrics: The metrics used to evaluate the models.
- Artifacts: The artifacts generated after training or evaluating a model (e.g., training curves, ROC curves, etc.).
- Trials: The set of ML pipelines and workflows that will be run in each dataset. A trial is defined by the
following parts:
- Pipeline: The sequence of steps that will be executed to train and evaluate the model.
- Model: The AI model.
- Transformations: The sequence of transformations that will be applied to the data before training or evaluating the model.
- Data loaders: The method used to load data from the datasets. Data loaders handle batches, shuffling, loading parallelization, etc.
- Pipeline: The sequence of steps that will be executed to train and evaluate the model.
- Runs: The actual execution of trials. The number of runs in a trial depends on the splitting strategy chosen for the experiment (e.g., in a 5-fold cross-validation, there will be 5 runs for each trial).
- Experiment plan: The set of experiments to be conducted.
Building experiments
An experiment is defined in a YAML file that must follow the structure described in docs/experiments.md. An experiment plan is given by a dedicated directory that contains the YAML files defining the experiments.
In addition to the definition and configuration of the experiments, many functionalities and components can be customized, including AI models, transformations, metrics, and artifacts, among others. See docs/customization.md for detailed instructions on how to implement custom classes.
Running experiments
To run experiments, simply run
octopuscl/scripts/run_experiments.py
with the following arguments:
-e, --environment(required): The environment where the experiments will be run. It can bedevelopment,staging, orproduction.-d, --directory(required): Path to the directory that contains the YAML files defining the experiments.
In staging and production environments, trials can be run either locally or on AWS EC2.
Tracking experiments
The Experiment Manager delegates experiment tracking to MLflow, which provides a web-based UI called MLflow Tracking UI.
Requirements
General
Regardless of the environment in which you are running the trials, the following requirements must be met:
- Python 3.10+ and dependencies (see requirements.txt)
Additionally, you will need to set specific environment variables (see
octopuscl.env) depending on the environment in
which you are running the trials.
Development Environment
If you need to download or upload datasets from or to the Dataset Repository, you must have:
Staging & Production Environments
In the staging and production environments, the following requirements must be met:
- AWS setup (AWS CLI, keys, users, roles, policies, S3 bucket)
- Prebuilt Docker image with OctopusCL installed, accessible via AWS ECR
- Publicly accessible MLflow tracking server
If you are running trials locally, you must also have:
Contributions
Maintainers
OctopusCL is maintained by the following individuals (in alphabetical order):
- Enrique Hernández Calabrés (@ehcalabres)
- Marco D'Alessandro (@IoSonoMarco)
- Mikel Elkano Ilintxeta (@melkilin)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file octopuscl_lib-0.1.0b0.tar.gz.
File metadata
- Download URL: octopuscl_lib-0.1.0b0.tar.gz
- Upload date:
- Size: 90.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9610b6ef0be375b7da164d4fd234fce5da65cf5dbaeb1d490d446a9ded5afa8c
|
|
| MD5 |
198659e0548bb67aab6683a0e7188c51
|
|
| BLAKE2b-256 |
1d1832f3e97ebd9e9d97664517bd5d5097faf5bc0dfbee84db4e49554d6c3201
|
Provenance
The following attestation bundles were made for octopuscl_lib-0.1.0b0.tar.gz:
Publisher:
publish-to-pypi.yml on neuraptic/octopuscl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
octopuscl_lib-0.1.0b0.tar.gz -
Subject digest:
9610b6ef0be375b7da164d4fd234fce5da65cf5dbaeb1d490d446a9ded5afa8c - Sigstore transparency entry: 153581562
- Sigstore integration time:
-
Permalink:
neuraptic/octopuscl@be5e90a065e21008f4ffd478a45cc0871ad14eac -
Branch / Tag:
refs/heads/fix/pypi-publication-workflow - Owner: https://github.com/neuraptic
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@be5e90a065e21008f4ffd478a45cc0871ad14eac -
Trigger Event:
push
-
Statement type:
File details
Details for the file octopuscl_lib-0.1.0b0-py3-none-any.whl.
File metadata
- Download URL: octopuscl_lib-0.1.0b0-py3-none-any.whl
- Upload date:
- Size: 99.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4358085797ea08c523d4f40dc6def4f3d6b7f7205a051a2d4da9923c77f5bc28
|
|
| MD5 |
1f257914d867439a6dfef95e1e5ebca6
|
|
| BLAKE2b-256 |
46b7042316ac8ae4d4e3f8ca6ff6d8f7d44a793b7616fb16de66c7987314223d
|
Provenance
The following attestation bundles were made for octopuscl_lib-0.1.0b0-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on neuraptic/octopuscl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
octopuscl_lib-0.1.0b0-py3-none-any.whl -
Subject digest:
4358085797ea08c523d4f40dc6def4f3d6b7f7205a051a2d4da9923c77f5bc28 - Sigstore transparency entry: 153581563
- Sigstore integration time:
-
Permalink:
neuraptic/octopuscl@be5e90a065e21008f4ffd478a45cc0871ad14eac -
Branch / Tag:
refs/heads/fix/pypi-publication-workflow - Owner: https://github.com/neuraptic
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@be5e90a065e21008f4ffd478a45cc0871ad14eac -
Trigger Event:
push
-
Statement type: