Skip to main content

A Python framework for structuring, managing, processing and FAIR-ising scientific marine image datasets.

Project description

Marimba Logo

A Python framework for structuring, managing and processing FAIR scientific image datasets



Contents


Overview

Marimba is a Python framework designed for the efficient processing of FAIR (Findable, Accessible, Interoperable, and Reusable) scientific image datasets. Developed collaboratively by CSIRO and MBARI, Marimba provides core functionality for structuring, processing, and ensuring the FAIR compliance of scientific image data.

The framework features a Typer Command Line Interface (CLI) enhanced by Rich for an improved user experience. Marimba offers a well-defined API (Application Programming Interface) that enables seamless integration with external scripts and Graphical User Interfaces (GUIs).

Marimba is particularly well-suited for researchers, data scientists, and engineers working in marine science and other fields that require large-scale and streamlined image dataset management. Typical use cases include automating the processing of imagery from underwater vehicles, integrating multi-instrument data for comprehensive analysis, and preparing datasets for publication in FAIR-compliant repositories.

(back to top)


Design

Marimba defines three core concepts:

  • Project: A Marimba Project is a standardised, high-level structure designed to manage the entire processing workflow for producing FAIR image datasets. It serves as the primary context for importing, processing, packaging and distributing these datasets, with all high-level operations managed by the core Marimba system.

  • Pipelines: A Marimba Pipeline encapsulates the implementation of all processing stages for a single or multi-instrument system. Each Pipeline operates in isolation, containing all necessary logic to fully process image data, which may include multiple image or video sources, associated navigational data, and other ancillary information. The core Marimba system manages Pipeline execution, and developing a custom Pipeline is the only requirement for processing FAIR image datasets for new instruments or systems with Marimba.

  • Collections: A Marimba Collection is a set of data that is imported into a Marimba project and can include a diverse aggregation of data from a single or multi-instrument system. Each Collection is isolated within the context of Marimba's core processing environment. During execution, Marimba Pipelines operate on each Collection in parallel, applying the specialised processing to the data contained within each Collection.

(back to top)


Features

The Marimba framework offers a number of advanced features designed for the specific needs of scientific image processing:

  • Project Structuring and Management:

    • Marimba enables a systematic approach to structuring and managing scientific image data projects throughout the entire processing workflow
    • Core features of Marimba manage the parallelised execution of isolated Pipelines on sandboxed Collections, enabling full automation of the processing workflow
    • Marimba supports the use of hard links during processing to prevent data duplication and optimise storage efficiency
    • Marimba provides a unified interface for importing, processing, packaging, and distributing datasets, ensuring consistency and efficiency across all stages
  • File and Metadata Management:

    • Custom Marimba Pipelines support the implementation of specific naming conventions to automatically rename image files
    • Marimba supports user prompting to manually input Pipeline and Collection-level metadata
    • Metadata configuration dictionaries can be optionally passed via the CLI to automate manual input stages
    • Marimba provides extensive capabilities for managing image metadata, including:
      • Ensuring compliance with the iFDO (image FAIR Digital Object) standard to ensure interoperability and reusability
      • Integrating image datasets with corresponding navigation and sensor data, when available
      • Embedding metadata directly into image EXIF tags for greater accessibility
  • Standard Image and Video Library:

    • Marimba provides a comprehensive standard library of image and video processing modules that can:
      • Convert, compress and resize imagery using Pillow
      • Transcode, segment and extract frames from videos using Ffmpeg (to be integrated)
      • Automatically generate thumbnails for images and videos and create composite overview images for rapid assessment of image datasets
      • Detect duplicate, blurry, or improperly exposed images using CleanVision (to be integrated)
  • Dataset Packaging and Distribution:

    • Marimba offers a standardised approach for packaging processed FAIR image datasets, including:
      • Collating all processing logs to archive the entire dataset provenance, ensuring transparency and traceability
      • Generating file manifests to facilitate dataset validation
      • Dynamically generating summaries of image and video dataset statistics
    • Marimba also provides mechanisms for distributing packaged FAIR image datasets including:
      • Uploading FAIR image datasets to S3 buckets

(back to top)


Installation

Marimba can be installed using the Python pip package manager. Ensure that Python version 3.10 or greater is installed in your environment before proceeding.

To install Marimba, open your terminal or command prompt and run the following command:

pip install marimba

This will download and install the latest version of Marimba along with its required dependencies. After installation, you can verify the installation by running Marimba and displaying the default help menu:

marimba

Marimba has minimal system level dependencies, such as ffmpeg, which are required for its operation. On Ubuntu you can install ffmpeg with:

sudo apt install ffmpeg

To set up a Marimba development environment, please refer to the Environment Setup Guide, which provides detailed instructions and guidelines for configuring your development environment.

(back to top)


Getting Started

Marimba offers a streamlined CLI that encompasses the entire post-acquisition data processing workflow. Below is a minimal demonstration of the key CLI commands required to progress through all the Marimba processing stages.

  1. Create a new Marimba Project:

    marimba new project MY-PROJECT
    cd MY-PROJECT
    
  2. Create a new Marimba Pipeline:

    marimba new pipeline MY-INSTRUMENT https://path.to/my-instrument-pipeline.git
    
  3. Import new Marimba Collections:

    marimba import COLLECTION-ONE '/path/to/collection/one/'
    marimba import COLLECTION-TWO '/path/to/collection/two/'
    
  4. Process the imported Collections with the installed Pipelines:

    marimba process
    
  5. Package the FAIR image dataset:

    marimba package MY-FAIR-DATASET --version 1.0 --contact-name "Keiko Abe" --contact-email "keiko.abe@email.com"
    

For additional details and advanced usage, please refer to the Overview and CLI Usage Guide.

Note: Keiko Abe is a renowned Japanese marimba player and composer, widely recognised for her role in establishing the marimba as a respected concert instrument.

(back to top)


Documentation

Marimba offers extensive documentation to support both users and developers:

Users

If you're interested in creating your own Pipelines to process image data, Marimba provides a comprehensive guide to help you get started. This documentation covers everything from setting up a Pipeline git repository to implementing custom processing pipelines.

  • Overview and CLI Usage Guide: Gain an architectural understanding of Marimba and explore the various CLI commands and options available to enhance pipeline management and execution, detailed in the comprehensive CLI usage guide.

  • Pipeline Implementation Guide: This guide offers a step-by-step tutorial on how to design and tailor Marimba Pipelines to suit your unique data processing requirements. From initial setup to advanced customization techniques, learn everything you need to efficiently use Marimba for your specific projects.

Developers

For developers who want to script Marimba using the CLI or leverage the Marimba API for more advanced integrations, we offer detailed documentation that covers all aspects of Marimba’s capabilities.

  • CLI Scripting Guide: Learn how to automate data processing workflows using Marimba's CLI. This guide provides detailed instructions and examples to help you streamline your data processing operations.

  • API Reference: Explore the Marimba API to integrate its functionalities into your applications or workflows. The reference includes detailed descriptions of Python API endpoints and their usage.

These resources are designed to help you make the most of Marimba, whether you are processing large datasets or integrating Marimba into your existing systems.

(back to top)


Contributing

Marimba is an open-source project, and we welcome feedback and contributions from the community. If you have ideas or suggestions to improve Marimba, we encourage you to submit them using our GitHub issue tracker. For enhancements or new features, we encourage you to fork the repository and submit a pull request. Please refer to the Contributing Guide for detailed guidelines on how to contribute.

(back to top)


License

This project is distributed under the CSIRO BSD/MIT license.

(back to top)


Contact

For inquiries related to this repository, please contact:

(back to top)


Acknowledgments

Marimba was developed as a collaborative effort between CSIRO and MBARI, two leading institutions in marine science and technology. The conceptual foundation of Marimba was formulated at CSIRO in late 2022. Substantial elements of its initial design and implementation were developed during the CSIRO Image Data Collection and Delivery Hackathon in early 2023, with further collaborative advancements between CSIRO and MBARI in late 2023. Marimba was open-sourced on GitHub and PyPI in mid-2024 and officially launched at the Marine Imaging Workshop 2024.

The development of this project has greatly benefited from the contributions of the following people:

  • Chris Jackett - CSIRO Environment
  • Kevin Barnard - MBARI
  • Nick Mortimer - CSIRO Environment
  • David Webb - CSIRO NCMI
  • Aaron Tyndall - CSIRO NCMI
  • Franzis Althaus - CSIRO Environment
  • Candice Untiedt - CSIRO Environment
  • Carlie Devine - CSIRO Environment
  • Bec Gorton - CSIRO Environment
  • Ben Scoulding - CSIRO Environment

(back to top)


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marimba-0.4.1.tar.gz (74.7 kB view hashes)

Uploaded Source

Built Distribution

marimba-0.4.1-py3-none-any.whl (85.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page