Skip to main content

An orchestration framework for End-to-End Machine Learning Serving with Resource Optimization on Heterogeneous Edge

Project description

ROHE


Documentation PyPI - Status PyPI - Wheel PyPI - Version PyPI - Python Version Code style: Ruff GitHub Actions Workflow Status License


ROHE is a platform for orchestrating end-to-end machine learning inference pipelines on heterogeneous edge clusters. It provides quality-aware orchestration, runtime observation, and contract-driven SLA enforcement.

Features:


High-level view

ROHE High-level View

Fig. ROHE High-level View

Installation

# Install with uv (recommended)
git clone https://github.com/rdsea/ROHE.git
cd ROHE
uv sync

# Or from PyPI
pip install rohe

Note: Due to the continuous development of the required Python libraries, the installation may have some dependency conflicts.

Structure of the repository

The repository is structured as follows:

ROHE/
├── src/rohe/                 # Core platform
│   ├── api/                  # FastAPI endpoints
│   ├── cli/                  # Typer CLI
│   ├── common/               # Data models, abstractions, utilities
│   ├── export/               # Experiment data export
│   ├── experiment/           # Experiment lifecycle management
│   ├── external/             # External integrations (YOLO models)
│   ├── lib/                  # Deployment utilities
│   ├── messaging/            # Message bus abstractions
│   ├── models/               # Pydantic domain models (ExecutionPlan, contracts, metrics)
│   ├── monitoring/           # rohe-sdk, inference reporter, OTel
│   ├── observation/          # Observation agents, metric collection
│   ├── orchestration/        # Inference orchestration (v2, adaptive, DREAM, LLF)
│   ├── quality/              # Quality evaluation (rules, anomaly, LLM diagnosis)
│   ├── registry/             # Service discovery (K8s, HTTP)
│   ├── repositories/         # Data access (MongoDB, Redis)
│   ├── service/              # FastAPI service factories
│   ├── service_registry/     # Consul integration
│   └── storage/              # Storage connectors (MongoDB, MinIO, S3)
├── examples/applications/    # 4 reference applications
│   ├── bts/                  # Building Time Series (4 models)
│   ├── cctvs/                # CCTV Surveillance (5 models)
│   ├── object_classification/# Image Classification (4 models)
│   ├── smart_building/       # Multi-modal Activity Recognition (8 models)
│   └── common/               # Shared service factories
├── experiments/              # Experiment scenarios and analysis
├── deployment/               # Infrastructure (Redis, Grafana)
├── userModule/               # User-extensible algorithms
├── datasets/                 # Experiment data
├── docs/                     # Documentation
└── tests/                    # Unit tests (203+ tests)

Publications

On Optimizing Resources for Real-Time End-to-End Machine Learning in Heterogeneous Edges: Pdf

Implementation:

Note: Other publications reuse most parts of this implementation.

Citation:

@article{nguyen2025optimizing,
  title={On Optimizing Resources for Real-Time End-to-End Machine Learning in Heterogeneous Edges},
  author={Nguyen, Minh-Tri and Truong, Hong-Linh},
  journal={Software: Practice and Experience},
  volume={55},
  number={3},
  pages={541--558},
  year={2025},
  publisher={Wiley Online Library}
}

Novel contract-based runtime explainability framework for end-to-end ensemble machine learning serving: Pdf

  • This publication uses ROHE as the orchestration framework with Observation Service for monitoring and explainability.
  • The core abstraction of ML contract can be found in QoA4ML
  • Example Application: Malware Detection and CCTVS
  • Sample Data

Citation:

@inproceedings{nguyen2024novel,
  title={Novel contract-based runtime explainability framework for end-to-end ensemble machine learning serving},
  author={Nguyen, Minh-Tri and Truong, Hong-Linh and Truong-Huu, Tram},
  booktitle={Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI},
  pages={234--244},
  year={2024}
}

Security orchestration with explainability for digital twins-based smart systems: Pdf

  • This publication also uses ROHE as the orchestration framework with Observation Service for monitoring and explainability.
  • The core abstraction of ML contract can be found in RXOMS
  • Example Application: Security in Digital Twins Network

Citation:

@inproceedings{nguyen2024security,
  title={Security orchestration with explainability for digital twins-based smart systems},
  author={Nguyen, Minh-Tri and Lam, An Ngoc and Nguyen, Phu and Truong, Hong-Linh},
  booktitle={2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)},
  pages={1194--1203},
  year={2024},
  organization={IEEE}
}

Optimizing Multiple Consumer-specific Objectives in End-to-End Ensemble Machine Learning Serving: Pdf

Implementation:

Citation:

@inproceedings{nguyen2024optimizing,
  title={Optimizing Multiple Consumer-specific Objectives in End-to-End Ensemble Machine Learning Serving},
  author={Nguyen, Minh-Tri and Truong, Hong-Linh and Arcaini, Paolo and Ishikawa, Fuyuki},
  booktitle={2024 IEEE/ACM 17th International Conference on Utility and Cloud Computing (UCC)},
  pages={103--108},
  year={2024},
  organization={IEEE}
}

IoT Jounal submission (on going)

1. Observation Service

1.1 User Guide

  • Prerequisite: before using Observation Agent, users need:
    • Database service (e.g., MongoDB)
    • Communication service (e.g., AMQP message broker)
    • Container environment (e.g., Docker)
    • Visualization service (e.g., Prometheus, Graphana - optional)
  • Observation Service includes registration service and agent manager. Users can modify Observation Service configurations in $ROHE_PATH/config/observationConfigLocal.yaml. The configuration defines: - Protocols with default configurations for public (connector) and consume (collector) metrics. - Database configuration where metrics and application data/metadata are stored. - Container Image of the Observation Agent - Logging Level (debugging, warning, etc)
  • To deploy Observation Service, use rohe:
$ rohe start observation-service
  • Application Registration
    • Users can register the application using rohe. Application metadata and related configurations will be saved to the Database
    • When register an end-to-end ML application, the users must provide application name (app_name - string), run ID (run_id - string), user ID (user_id - string), and send registration request to the Observation Service via its url.
    • The Observation Service will generate:
    • Application ID: appID
    • Database name: db for saving metric reports in runtime
    • Qoa configuration: qoa_config for reporting metrics

Example

$ rohe observation register-app --app <application_name> --run <run_ID> --user <user_ID> --url <resigstration_service_url> --output_dir <folder_path_to_save_app_metadata>
  • Then, users must implement QoA probes manually into the application. Probes use this metadata to register with the observation service. The metadata can be extended with information like stage_id microserviceID, method, role, etc. After the registration, the probes will receive communication protocol & configurations to report metrics.

  • While the applications are running, the reported metrics are processed by an Observation Agent. The Agent must be configured with application name, command, stream configuration including: - Processing window: interval, size - Processing module: specify parser and function names to process metric reports. User must define these processing moduled in $ROHE_PATH/userModule (e.g., userModule/common), including metric parser for parsing metric reports and function for window processing.

  • To start the Agent, the user can use rohe:

$ rohe observation start-observation-agent --app <application_name> --conf <path_to_agent_configuration> --url <resigstration_service_url>
  • The Observation service will start the Agent on a container (e.g., Docker container). Metric processing results from the Agent are saved to files or database or message broker (developing) or Prometheus/Grafana (developing) depending on Agent configuration

  • To stop the Agent, the user can also use rohe:

$ rohe observation stop-observation-agent --app <application_name> --conf <path_to_agent_configuration> --url <resigstration_service_url>
  • To delete/unregister an application using rohe:
$ rohe observation delete-app --app <application_name> --url <resigstration_service_url>

1.2 Development Guide

1.2.1 Registration Service

  • This service allows users to register and unregister applications. Service receives commands from REST, developer can modify core.observation.restService module to support more commands for editing/updating application.
  • Currently this service supports MongoDB as database and AMPQ as communication protocol. The service will also support other communication protocols and databases

1.2.2 Observation Agent

  • Agents are currently deployed on the local docker environment: core.observation.containerizedAgent.
  • Remote deployment is supported via K8s manifests in each application's k8s/ directory.

2. Orchestration Service

2.1 User Guide

  • Prerequisite: before using Orchestration Service, users need:
    • Database service (e.g., MongoDB)
  • The Orchestration Service allocate service instances on edge nodes base on a specific orchestration algorithm (currently using scoring algorithm). Users can modify Orchestration Service configurations in $ROHE_PATH/config/orchestrationConfigLocal.yaml. The configuration defines: - Database configuration where metrics and application data/metadata are stored. - Service queue priority - Orchestration algorithm
  • To deploy Orchestration Service, use rohe.
$ rohe start orchestration-service
  • Add nodes to the orchestration system
    • Users can add nodes by using rohe. The node metadata will be saved to the Database
    • When adding nodes, the users must provide file path to the node configurations (-conf) and -url, the url to the Orchestration Service.
    • The template of node configuration is in $ROHE_PATH/templates/orchestration_command/add_node.yaml

Example

$ rohe orchestration add-node --app <application_name> --conf <configuration_path> --url <orchestration_service_url>
  • Add service to the orchestration system
    • Users can add services by using rohe. The service metadata will be saved to the Database
    • When adding service, the users must provide file path to the service configurations (-conf) and -url, the url to the Orchestration Service.
    • The template of service configuration is in $ROHE_PATH/templates/orchestration_command/add_service.yaml

Example

$ rohe orchestration add-service --app <application_name> --conf <configuration_path> --url <orchestration_service_url>
  • Get node information from the orchestration system
    • Users can get node information by using rohe.
    • To get node information, the users must provide file path to the get command (-conf) and -url, the url to the Orchestration Service.
    • The template of command is in $ROHE_PATH/templates/orchestration_command/get_node.yaml

Example

$ rohe orchestration get-node --app <application_name> --conf <configuration_path> --url <orchestration_service_url>
  • Get service information from the orchestration system
    • Users can get service information by using rohe.
    • To get service information, the users must provide file path to the get command (-conf) and -url, the url to the Orchestration Service.
    • The template of command is in $ROHE_PATH/templates/orchestration_command/get_service.yaml

Example

$ rohe orchestration get-service --app <application_name> --conf <configuration_path> --url <orchestration_service_url>
  • Remove nodes from the orchestration system
$ rohe orchestration remove-node --app <application_name> --conf <configuration_path> --url <orchestration_service_url>
  • Users can remove node by using rohe.
  • To remove nodes, the users must provide file path to the get command (-conf) and -url, the url to the Orchestration Service.
  • The template of command is in $ROHE_PATH/templates/orchestration_command/remove_node.yaml

Example

$ rohe orchestration remove-node --app <application_name> --conf <configuration_path> --url <orchestration_service_url>
  • Start Orchestration Agent
    • Users can start the agent by using rohe in the /bin folder.
    • Users must provide file path to the get command (-conf) and -url, the url to the Orchestration Service.
    • The agent constantly check services in the service queue (for services waiting for being allocated). If the service queue is not empty, the agent will find location for allocate the service in the available nodes.
    • The template of command is in $ROHE_PATH/templates/orchestration_command/start_orchestration.yaml

Example

$ rohe orchestration start-agent --app <application_name> --conf <configuration_path> --url <orchestration_service_url>
  • Stop Orchestration Agent
    • Users can start the agent by using rohe in the /bin folder.
    • Users must provide file path to the get command (-conf) and -url, the url to the Orchestration Service.
    • The template of command is in $ROHE_PATH/templates/orchestration_command/stop_orchestration.yaml
$ rohe orchestration stop-agent --app <application_name> --conf <configuration_path> --url <orchestration_service_url>

2.2 Development Guide

2.2.1 Resource Management

The module provide the abstract class/object to manage the infrastructure resource by Node; application by Deployment; network routine by Service; and eviroment variable by ConfigMap.

  • Node: physical node
  • Deployment: each application has multiple microservices. Each microservice has its own Deployment setup specify: image, resource requirement, replicas, etc
  • Microservice: each microservice is advertised with a microservice name within K3s network so that other microservices can communicate with it.
  • ConfigMap: provide initial environment variable for docker containers of each deployment when starting.
  • resource: provide abstract, high-level class to manage resources (Microservice Queue and Node Collection).

2.2.2 Deployment Management

  • Provide utilities for generating deployment files from template ($ROHE_PATH/templates/deployment_template.yaml)
  • Deploy microservices, pod based on generated deployment files
  • See orchestration/allocation/algorithms/ for pluggable algorithm implementations.

2.2.3 Algorithm

This module provide functions to select resource to allocate microservices.

Current implementation: Scoring Algorithm

  • Input:

    • Microservice from a microservice Queue (queue of microservice need to be allocated), each microservice in the queue include
      • the number of instances (replicas/scales)
      • CPU requirement (array of CPU requirements on every CPU core). Example: [100,50,50,50] - the microservice use 4 CPUs with 100, 50, 50, and 50 millicpu on each core respectively.
      • Memory requirement (rss, vms - MByte)
      • Accelerator requirement (GPU - %)
      • Sensitivity: 0 - Not sensitive; 1 - CPU sensitive; 2 - Memory sensitive; 3 CPU & Memory sensitive
      • Other metadata: microservice name, ID, status, node (existing deployment), running (existing running instance), container image, ports configuration

    Example:

    {
        "EW:VE:TW:WQ:01":{
            "microservice_name":"object_detection_web_service",
            "node": {},
            "status": "queueing",
            "instance_ids": [],
            "running": 0,
            "image": "rdsea/od_web:2.0",
            "ports": [4002],
            "port_mapping": [{
                "con_port": 4002,
                "phy_port": 4002
            },{
                "con_port": 4003,
                "phy_port": 4003
            }],
            "cpu": 550,
            "accelerator": {
                "gpu": 0
            },
            "memory": {
                "rss": 200,
                "vms": 500
            },
            "processor": [500,50],
            "sensitivity": 0,
            "replicas": 2
    }
    
    • Node Collection: list of available nodes for allocating microservices each node includes information of capacity and used resources:
      • CPU (millicpu)
      • Memory (rss, vms - MByte)
      • Accelerator (GPU - %)

    Example

    "node1":{
        "node_name":"RaspberryPi_01",
        "MAC":"82:ae:30:11:38:01",
        "status": "running",
        "frequency": 1.5,
        "accelerator":{},
        "cpu": {
            "capacity": 4000,
            "used": 0
        },
        "memory": {
            "capacity": {
                "rss": 4096,
                "vms": 4096
            },
            "used": {
                "rss": 0,
                "vms": 0
            }
        },
        "processor": {
            "capacity": [1000,1000,1000,1000],
            "used": [0,0,0,0]
        }
    }, ...
    

Workflow of Scoring Algorithm: Scoring Workflow

  • Updating Microservice Queue
  • Filtering Nodes from the Node Collection
  • Scoring filtered node
  • Selecting node based on the score, applying different strategies: first/best/worst-fit

The v2 orchestrator supports multiple algorithms selectable at runtime:

  • v2: Async production orchestrator with ExecutionPlan and DataHub
  • adaptive: Original multimodal orchestrator
  • dream: DREAM deadline-aware allocation
  • llf: Least-laxity-first scheduling

Example applications can be deployed on K8s:

bash examples/applications/bts/scripts/build.sh rdsea 0.0.1 true
bash examples/applications/bts/scripts/deploy.sh --local --load-images

Authors/Contributors

  • Minh-Tri Nguyen
  • Hong-Linh Truong
  • Vuong Nguyen
  • Anh-Dung Nguyen

License

Apache License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rohe-0.2.1.tar.gz (33.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rohe-0.2.1-py3-none-any.whl (205.9 kB view details)

Uploaded Python 3

File details

Details for the file rohe-0.2.1.tar.gz.

File metadata

  • Download URL: rohe-0.2.1.tar.gz
  • Upload date:
  • Size: 33.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for rohe-0.2.1.tar.gz
Algorithm Hash digest
SHA256 00309af4faafa99c0fa6c2399d12a55e7a86c281e9ad3710068020a893fb0c84
MD5 81a081c7328db569e0e3671591dad86b
BLAKE2b-256 627569c3564a5a0df29bd7fa6a97bed79d6ed85e3b63ea0c8c376847acf47de4

See more details on using hashes here.

File details

Details for the file rohe-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: rohe-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 205.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for rohe-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 479e44834e1b23cddb2ff814ef7a721902bc82b10e4060535e745e39bca943d4
MD5 1075880a4bd47fb3f87c640af8428df6
BLAKE2b-256 620e3e22c363ab2d60bd135a6c4b43f0e4a542097b22358babe9718a571d530e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page