A framework for building and experimenting with multimodal models in continual learning scenarios

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

neuraptic.ai

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.10

Project description

OctopusCL

Introduction
Installation
Usage
Requirements
Contributions
Maintainers

Introduction

OctopusCL is a framework for building and experimenting with multimodal models in continual learning scenarios. It is composed of the following components:

Dataset Manager: A tool for managing datasets in a centralized repository called Dataset Repository.
Experiment Manager: A tool for running and tracking experiments in a centralized registry called Experiment Registry.
Model Manager (not available yet)

Installation

You can install OctopusCL with pip:

pip install octopuscl-lib

Usage

This section provides all the necessary instructions and examples to effectively use OctopusCL, covering dataset building and management, and experiment building, execution, and tracking.

Environments

OctopusCL can run in three different environments:

Development: The local environment for developing, debugging and testing new code.
Staging: The staging environment is a mirror of the production environment. It is used for testing new code before deploying it to production, providing a final check to ensure that everything behaves as expected in a production-like environment.
Production: The production environment is the live environment where the final code is deployed.

Dataset Manager

The Dataset Manager allows for uploading and downloading datasets to and from the Dataset Repository. It is accessible through the octopuscl/scripts/run_dataset_manager.py script (more information about the arguments on the Managing datasets section below).

Concepts

Dataset. A collection of data that is used to train and evaluate AI models. It is composed of the following parts:
- Schema. The dataset schema is a vital component that guarantees the structure and format of data are consistent. Beyond providing basic information such as the dataset name or description, the schema specifies the inputs, outputs, and metadata fields that AI models will use to train and make predictions.
- Examples. The actual data to be used to train and evaluate AI models. They must adhere to the dataset schema.
- Files: Optional files that may be referenced by the examples. They can be images, audio files, documents, or any other types of files that AI models need to process.
- Splits: Optional pre-defined splits that determine how examples are distributed across experiences and partitions (training, test, validation).
Dataset Repository. A centralized storage solution, hosted on Amazon S3, where all datasets are stored. It serves as the backbone for dataset management, providing a scalable and secure location for storing and retrieving datasets.

Building datasets

A dataset must be stored in a dedicated directory that contains the following files and directories:

schema.json: The JSON file with the dataset schema.
examples.csv or examples.db: The CSV file or SQLite database containing the examples of the dataset.
files (optional): The directory that contains all the files referenced by the examples.
splits (optional): A directory containing pre-defined splits that determine how examples are distributed across experiences and partitions (training, test, validation).

See docs/datasets.md for detailed instructions on how to build datasets.

Managing datasets

To upload or download a dataset, run octopuscl/scripts/run_dataset_manager.py with the following arguments:

-e, --environment (required): The environment where the dataset will be uploaded or downloaded. It can be development, staging, or production.
-a, --action (required): Either upload or download.
-l, --local_path (required): Path to the local file or directory.
-d, --dataset (optional): Name of the dataset. Required when downloading.
-r, --remote_path (optional): Path to the remote file or directory. Required when downloading. Ensure it ends with a / for directories.

Examples:

Uploading a dataset.

cd octopuscl/scripts
./run_dataset_manager.py -e production -a upload -l /path/to/local/dataset/

Downloading a dataset.

cd octopuscl/scripts
./run_dataset_manager.py -e production -a download -l /local/path/ -d dataset_name -r /remote/path/

Experiment Manager

The Experiment Manager allows for running and tracking experiments. It is accessible through the octopuscl/scripts/run_experiments.py script (more information about the arguments on the Running experiments section below).

Concepts

The Experiment Manager is built upon the following concepts:

Experiment: The process of evaluating a set of AI models, pipelines, or workflows under specific conditions. An experiment should define a clear objective that must be common to all the executions belonging to the experiment.
- Datasets: The set of datasets used in the experiment.
- Splitter: The method used to split the datasets into training, validation, and test sets.
- Metrics: The metrics used to evaluate the models.
- Artifacts: The artifacts generated after training or evaluating a model (e.g., training curves, ROC curves, etc.).
- Trials: The set of ML pipelines and workflows that will be run in each dataset. A trial is defined by the following parts:
  - Pipeline: The sequence of steps that will be executed to train and evaluate the model.
    - Model: The AI model.
    - Transformations: The sequence of transformations that will be applied to the data before training or evaluating the model.
  - Data loaders: The method used to load data from the datasets. Data loaders handle batches, shuffling, loading parallelization, etc.
- Runs: The actual execution of trials. The number of runs in a trial depends on the splitting strategy chosen for the experiment (e.g., in a 5-fold cross-validation, there will be 5 runs for each trial).
Experiment plan: The set of experiments to be conducted.

Building experiments

An experiment is defined in a YAML file that must follow the structure described in docs/experiments.md. An experiment plan is given by a dedicated directory that contains the YAML files defining the experiments.

In addition to the definition and configuration of the experiments, many functionalities and components can be customized, including AI models, transformations, metrics, and artifacts, among others. See docs/customization.md for detailed instructions on how to implement custom classes.

Running experiments

To run experiments, simply run octopuscl/scripts/run_experiments.py with the following arguments:

-e, --environment (required): The environment where the experiments will be run. It can be development, staging, or production.
-d, --directory (required): Path to the directory that contains the YAML files defining the experiments.

In staging and production environments, trials can be run either locally or on AWS EC2.

Tracking experiments

The Experiment Manager delegates experiment tracking to MLflow, which provides a web-based UI called MLflow Tracking UI.

Requirements

General

Regardless of the environment in which you are running the trials, the following requirements must be met:

Python 3.10+ and dependencies (see requirements.txt)

Additionally, you will need to set specific environment variables (see octopuscl.env) depending on the environment in which you are running the trials.

Development Environment

If you need to download or upload datasets from or to the Dataset Repository, you must have:

AWS setup (AWS CLI, keys, users, roles, policies, S3 bucket)

Staging & Production Environments

In the staging and production environments, the following requirements must be met:

AWS setup (AWS CLI, keys, users, roles, policies, S3 bucket)
Prebuilt Docker image with OctopusCL installed, accessible via AWS ECR
Publicly accessible MLflow tracking server

If you are running trials locally, you must also have:

Docker

Contributions

See contributing guidelines.

Maintainers

OctopusCL is maintained by the following individuals (in alphabetical order):

Enrique Hernández Calabrés (@ehcalabres)
Marco D'Alessandro (@IoSonoMarco)
Mikel Elkano Ilintxeta (@melkilin)

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

neuraptic.ai

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.10

Release history Release notifications | RSS feed

This version

0.1.0b0 pre-release

Dec 5, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octopuscl_lib-0.1.0b0.tar.gz (90.6 kB view details)

Uploaded Dec 5, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

octopuscl_lib-0.1.0b0-py3-none-any.whl (99.3 kB view details)

Uploaded Dec 5, 2024 Python 3

File details

Details for the file octopuscl_lib-0.1.0b0.tar.gz.

File metadata

Download URL: octopuscl_lib-0.1.0b0.tar.gz
Upload date: Dec 5, 2024
Size: 90.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for octopuscl_lib-0.1.0b0.tar.gz
Algorithm	Hash digest
SHA256	`9610b6ef0be375b7da164d4fd234fce5da65cf5dbaeb1d490d446a9ded5afa8c`
MD5	`198659e0548bb67aab6683a0e7188c51`
BLAKE2b-256	`1d1832f3e97ebd9e9d97664517bd5d5097faf5bc0dfbee84db4e49554d6c3201`

See more details on using hashes here.

Provenance

The following attestation bundles were made for octopuscl_lib-0.1.0b0.tar.gz:

Publisher: publish-to-pypi.yml on neuraptic/octopuscl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: octopuscl_lib-0.1.0b0.tar.gz
- Subject digest: 9610b6ef0be375b7da164d4fd234fce5da65cf5dbaeb1d490d446a9ded5afa8c
- Sigstore transparency entry: 153581562
- Sigstore integration time: Dec 5, 2024
Source repository:
- Permalink: neuraptic/octopuscl@be5e90a065e21008f4ffd478a45cc0871ad14eac
- Branch / Tag: refs/heads/fix/pypi-publication-workflow
- Owner: https://github.com/neuraptic
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@be5e90a065e21008f4ffd478a45cc0871ad14eac
- Trigger Event: push

File details

Details for the file octopuscl_lib-0.1.0b0-py3-none-any.whl.

File metadata

Download URL: octopuscl_lib-0.1.0b0-py3-none-any.whl
Upload date: Dec 5, 2024
Size: 99.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for octopuscl_lib-0.1.0b0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4358085797ea08c523d4f40dc6def4f3d6b7f7205a051a2d4da9923c77f5bc28`
MD5	`1f257914d867439a6dfef95e1e5ebca6`
BLAKE2b-256	`46b7042316ac8ae4d4e3f8ca6ff6d8f7d44a793b7616fb16de66c7987314223d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for octopuscl_lib-0.1.0b0-py3-none-any.whl:

Publisher: publish-to-pypi.yml on neuraptic/octopuscl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: octopuscl_lib-0.1.0b0-py3-none-any.whl
- Subject digest: 4358085797ea08c523d4f40dc6def4f3d6b7f7205a051a2d4da9923c77f5bc28
- Sigstore transparency entry: 153581563
- Sigstore integration time: Dec 5, 2024
Source repository:
- Permalink: neuraptic/octopuscl@be5e90a065e21008f4ffd478a45cc0871ad14eac
- Branch / Tag: refs/heads/fix/pypi-publication-workflow
- Owner: https://github.com/neuraptic
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@be5e90a065e21008f4ffd478a45cc0871ad14eac
- Trigger Event: push

octopuscl-lib 0.1.0b0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

OctopusCL

Introduction

Installation

Usage

Environments

Dataset Manager

Concepts

Building datasets

Managing datasets

Experiment Manager

Concepts

Building experiments

Running experiments

Tracking experiments

Requirements

General

Development Environment

Staging & Production Environments

Contributions

Maintainers

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance