Omnibenchmark core utilities: Setup and running of continous benchmarking modules as part of omnibenchmark

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Omnibenchmark

Generate and manage omnibenchmark modules for open and continuous benchmarking. Each module represents a single building block of a benchmark, e.g., dataset, method, metric. Omnibenchmark-py provides a structure to generate modules and automatically run them.

Installation

You can install omnibenchmark from PyPI:

pip install omnibenchmark

The module is supported on Python >= 3.8 and requires renku >= 1.5.0.

How to use omnibenchmark

For a detailed documentation and tutorials check the omnibenchmark documentation.

Quick start

Omnibenchmark uses the renku platform to run open and continous benchmarks. To contribute an independent module to one of the existing benchmarks please start by creating a new renku project. Each module consist of a Docker image, that defines it's environment, a dataset to store outputs and metadata, a workflow that describes how to generate outputs and input and parameter datasets with input files and parameter definitions, if they are used. Thus each module is an independent benchmark part and can be run, used and modified independently as such. Modules are connected by importing (result) datasets from other modules as input datasets and will automatically be updated according to them.
All relevant information on how to run a specific module are stored as OmniObject. The most convinient way to generate an instance of an OmniObject is to build it from an config.yaml file:

## modules
from omnibenchmark.utils.build_omni_object import get_omni_object_from_yaml

## Load object
omni_obj = get_omni_object_from_yaml('src/config.yaml')

The config.yaml defines all module specific information as inputs, outputs, script to run the module, benchmark that the module belongs to and much more. A simple config.yaml file could look like this. Please check section The config.yaml file for more details.

---
data:
    name: "module-name"
    title: "A new module"
    description: "A new module for omnibenchmark, e.g., a dataset, method, metric,..."
    keywords: ["module-type-key"]
script: "path/to/module_script"
outputs:
    template: "data/${name}/${name}_${out_name}.${out_end}"
    files:
        counts: 
            end: "mtx.gz"
        data_info:
            end: "json"
        meta:
            end: "json"
benchmark_name: "an-omnibenchmark"

Once you have an instance of an OmniObject you can check, if it looks as you expected like this:

## Check object
print(omni_obj.__dict__)
print(omni_obj.outputs.file_mapping)
print(omni_obj.command.command_line)

If all inputs, outputs and the command line call look as you expected you can run your module:

## create output dataset that stores all result/output files
omni_obj.create_dataset()

## Update inputs from other modules 
omni_obj.update_obj()

## Run your script with all defined inputs and outputs.
## This also generates a workflow description (plan) and is tracked as activity.
omni_obj.run_renku()

## Link output files to output dataset 
omni_obj.update_result_dataset()

## Save and commit to gitlab
renku_save()

Once these steps ran successfully and your outputs were generated the module is ready to be submitted to become a part of omnibenchmark.

What is renku?

Renku is a platform and tools for reproducible and collaborative data analysis from the Swiss Data Science Centre. Besides other functionalities renku provides a framework to create and run data analysis projects, which come with their own Docker container, datasets and workflows. By storing the metadata of projects and datasets on a knowledge graph renku facilitates provenance tracking and project interactions. To do so renku combines a set of microservices:

GitLab, for version control and project management
GitLFS, for file storage
Kubernetes/Docker, to manage containerized environment
Jupyter server, to provide interactive sessions
Apache Jena, to generate, store and manage triplets and the triplet store (knowledge graph)

Details on how to use renku can be found in their Documentation. Omnibenchmark uses renku to build and run collaborative and continuous benchmarks.

Create a new renku project

Omnibenchmark modules are build as separate renku projects. Contributions to one of the existing benchmarks start by creating a new project using the renku platform. This can be done by registering directly or using a Github account, an Orchid or a Switch-EDU ID. A new project can be created by a few clicks as described here. Templates can be chosen depending on the projects code or the Basic Python template. Project can then be populated/changed in an interactive renku session (see session tab of the project) or within the GitLab instance or clone of the project (Overview tab --> View in GitLab).

Project requirements

Project requirements can be defined by adapting the Dockerfile and specifying the all required R packages with their versions in the install.R file and all required python modules with their versions in the requirements.txt file. The later needs to contain at least omnibenchmark. If you work in an interactive session you need to save/commit your changes either by running renku save or git add/commit/push and close and restart the session once the new Docker image has been build. The built is triggered automatically when commiting changes, but can take a while depending on the requirments.

The config.yaml file

Usually all specific information about a benchmark project can be specified in a config.yaml file. Below we show an example with all standard fields and explanations to them. Many fields are optional and do not apply to all modules. All unneccessary fields can be skipped. There are further optional fields for specfic edge cases, that are described in an extra config.yaml file. In general the config.yaml file consists of a data, an input, an output and a parameter section as well as a few extra fields to define the main benchmark script and benchmark type. Except for the data section the other sections are optional. Multiple values can be parsed as Lists.

# Data section to describe the object and the associated (result) dataset
data:
    # Name of the dataset
    name: "out_dataset"
    # Title of the dataset (Optional)
    title: "Output of an example OmniObject"
    # Description of the dataset (Optional)
    description: "This dataset is supposed to store the output files from the example omniobject"
    # Dataset keyword(s) to make this dataset reachable for other projects/benhcmark components
    keywords: ["example_dataset"]
# Script to be run by the workflow associated to the project
script: "path/to/method/dataset/metric/script.py"
# Interpreter to run the script (Optional, automatic detection)
interpreter: "python"
# Benchmark that the object is associated to.
benchmark_name: "omni_celltype"
# Orchestrator url of the benchmark (Optional, automatic detection)
orchestrator: "https://www.orchestrator_url.com"
# Input section to describe output file types. (Optional)
inputs:
    # Keyword to find input datasets, that shall be imported 
    keywords: ["import_this", "import_that"]
    # Input file types
    files: ["count_file", "dim_red_file"]
    # Prefix (part of the filename is sufficient) to automatically detect file types by their names
    prefix:
        count_file: "counts"
        dim_red_file: ["features", "genes"]
# Output section to describe output file types. (Optional)
outputs:
    # Output filetypes and their endings
    files:
        corrected_counts: 
            end: ".mtx.gz"
        meta:
            end: ".json"
# Parameter section to describe the parameter dataset, values and filter. (Optional)
parameter:
    # Names of the parameter to use
    names: ["param1", "param2"]
    # Keyword(s) used to import the parameter dataset
    keywords: ["param_dataset"]
    # Filter that specify limits, values or combinations to exclude
    filter:
        param1:
            upper: 50
            lower: 3
            exclude: 12
    param2:
        "path/to/file/with/parameter/combinations/to/exclude.json"

Specific fields, that are only relevant for edge cases. These fields have their counterparts in the generated OmniObject. Changes of the attributes of the OmniObject instance have the same effects, but come with the flexibility of python code.

# Command to generate the workflow with (Optional, automatic detection)
command_line: "python path/to/method/dataset/metric/script.py --count_file data/import_this_dataset/...mtx.gz"
inputs:
    # Datasets and manual file type specifications (automatic detection!)
    input_files:
        import_this_dataset:
            count_file: "data/import_this_dataset/import_this_dataset__counts.mtx.gz"
            dim_red_file: "data/import_this_dataset/import_this_dataset__dim_red_file.json"
    # (Dataset) name that default input files belong to (Optional, automatic detection)
    default: "import_this_dataset"
    # Input dataset names that should be ignored (even if they have one of the specified input keywords assciated)
    filter_names: ["data1", "data2"]
outputs:
    # Template to automatically generate output filenames (Optional - recommended for advanced user only)
    template: "data/${name}/${name}_${unique_values}_${out_name}.${out_end}"
    # Variables used for automatic output filename generation (Optional - recommended for advanced user only)
    template_vars:
        vars1: "random"
        vars2: "variable"
    # Manual specification of mapping for output files and their corresponding input files and parameter values (automatic detection!)
    file_mapping:
        mapping1: 
            output_files:
                corrected_counts: "data/out_dataset/out_dataset_import_this__param1_10__param2_test_corrected_counts.mtx.gz"
                meta: "data/out_dataset/out_dataset_import_this__param1_10__param2_test_meta.json"
        input_files:
            count_file: "data/import_this_dataset/import_this_dataset__counts.mtx.gz"
            dim_red_file: "data/import_this_dataset/import_this_dataset__dim_red_file.json"
        parameter:
            param1: 10
            param2: "test"
    # Default output files (Optional, automatic detection)
    default:
        corrected_counts: "data/out_dataset/out_dataset_import_this__param1_10__param2_test_corrected_counts.mtx.gz"
        meta: "data/out_dataset/out_dataset_import_this__param1_10__param2_test_meta.json"
parameter:
    default:
        param1: 10
        param2: "test"

Omnibenchmark classes

Classes to manage omnibenchmark modules and their interactions. The main class is the OmniObject, that consollidates all relevant information and functions of a module. This object has further subclasses that define inputs, outputs, commands and the workflow.

OmniObject

Main class to manage an omnibenchmark module. It takes the following arguments:

name (str): Module name
keyword (Optional[List[str]], optional): Keyword associated to the modules output dataset.
title (Optional[str], optional): Title of the modules output dataset.
description (Optional[str], optional): Description of the modules output dataset.
script (Optional[PathLike], optional): Script to generate the modules workflow for.
command (Optional[OmniCommand], optional): Workflow command - will be automatically generated if missing.
inputs (Optional[OmniInput], optional): Definitions of the workflow inputs.
parameter (Optional[OmniParameter], optional): Definitions of the workflow parameter.
outputs (Optional[OmniOutput], optional): Definitions of the workflow outputs.
omni_plan (Optional[OmniPlan], optional): The workflow description.
benchmark_name (Optional[str], optional): Name of the benchmark the module is associated to.
orchestrator (Optional[str], optional): Orchestrator url of the benchmark th emodule is associated to. Automatic detection.
wflow_name (Optional[str], optional): Workflow name. Will be set to the module name if none.
dataset_name (Optional[str], optional): Dataset name. Will be set to the module name if none.

The following class methods can be run on an instance of an OmniObject:

create_dataset(): Method to create a renku dataset with the in the object specified attributes in the current renku project.
update_object(): Method to check for new imports or updates in input and the parameter datasets. Will update object attributes accordingly.
run_renku(): Method to generate and update the workflow and all output files as specified in the object.
update_result_dataset(): Method to update and add all output datasets to the dataset specified in the object.