A Python framework for structuring, managing, processing and FAIR-ising scientific marine image datasets.
Project description
Contents
- Overview
- Design
- Features
- Installation
- Getting Started
- Documentation
- Contributing
- License
- Contact
- Acknowledgments
Overview
Marimba is a Python framework designed for the efficient processing of FAIR (Findable, Accessible, Interoperable, and Reusable) scientific image datasets. Developed collaboratively by CSIRO and MBARI, Marimba provides core functionality for structuring, processing, and ensuring the FAIR compliance of scientific image data.
The framework features a Typer Command Line Interface (CLI) enhanced by Rich for an improved user experience. Marimba offers a well-defined API (Application Programming Interface) that enables seamless integration with external scripts and Graphical User Interfaces (GUIs).
Marimba is particularly well-suited for researchers, data scientists, and engineers working in marine science and other fields that require large-scale and streamlined image dataset management. Typical use cases include automating the processing of imagery from underwater vehicles, integrating multi-instrument data for comprehensive analysis, and preparing datasets for publication in FAIR-compliant repositories.
Design
Marimba defines three core concepts:
-
Project: A Marimba Project is a standardised, high-level structure designed to manage the entire processing workflow for producing FAIR image datasets. It serves as the primary context for importing, processing, packaging and distributing these datasets, with all high-level operations managed by the core Marimba system.
-
Pipelines: A Marimba Pipeline encapsulates the implementation of all processing stages for a single or multi-instrument system. Each Pipeline operates in isolation, containing all necessary logic to fully process image data, which may include multiple image or video sources, associated navigational data, and other ancillary information. The core Marimba system manages Pipeline execution, and developing a custom Pipeline is the only requirement for processing FAIR image datasets for new instruments or systems with Marimba.
-
Collections: A Marimba Collection is a set of data that is imported into a Marimba project and can include a diverse aggregation of data from a single or multi-instrument system. Each Collection is isolated within the context of Marimba's core processing environment. During execution, Marimba Pipelines operate on each Collection in parallel, applying the specialised processing to the data contained within each Collection.
Features
The Marimba framework offers a number of advanced features designed for the specific needs of scientific image processing:
-
Project Structuring and Management:
- Marimba enables a systematic approach to structuring and managing scientific image data projects throughout the entire processing workflow
- Core features of Marimba manage the parallelised execution of isolated Pipelines on sandboxed Collections, enabling full automation of the processing workflow
- Marimba supports the use of hard links during processing to prevent data duplication and optimise storage efficiency
- Marimba provides a unified interface for importing, processing, packaging, and distributing datasets, ensuring consistency and efficiency across all stages
-
File and Metadata Management:
- Custom Marimba Pipelines support the implementation of specific naming conventions to automatically rename image files
- Marimba supports user prompting to manually input Pipeline and Collection-level metadata
- Metadata configuration dictionaries can be optionally passed via the CLI to automate manual input stages
- Marimba provides extensive capabilities for managing image metadata, including:
- Ensuring compliance with the iFDO (image FAIR Digital Object) standard to ensure interoperability and reusability
- Integrating image datasets with corresponding navigation and sensor data, when available
- Embedding metadata directly into image EXIF tags for greater accessibility
-
Standard Image and Video Library:
- Marimba provides a comprehensive standard library of image and video processing modules that can:
- Convert, compress and resize imagery using Pillow
- Transcode, segment and extract frames from videos using Ffmpeg (to be integrated)
- Automatically generate thumbnails for images and videos and create composite overview images for rapid assessment of image datasets
- Detect duplicate, blurry, or improperly exposed images using CleanVision (to be integrated)
- Marimba provides a comprehensive standard library of image and video processing modules that can:
-
Dataset Packaging and Distribution:
- Marimba offers a standardised approach for packaging processed FAIR image datasets, including:
- Collating all processing logs to archive the entire dataset provenance, ensuring transparency and traceability
- Generating file manifests to facilitate dataset validation
- Dynamically generating summaries of image and video dataset statistics
- Marimba also provides mechanisms for distributing packaged FAIR image datasets including:
- Uploading FAIR image datasets to S3 buckets
- Marimba offers a standardised approach for packaging processed FAIR image datasets, including:
Installation
Marimba can be installed using the Python pip package manager. Ensure that Python version 3.10 or greater is installed in your environment before proceeding.
To install Marimba, open your terminal or command prompt and run the following command:
pip install marimba
This will download and install the latest version of Marimba along with its required dependencies. After installation, you can verify the installation by running Marimba and displaying the default help menu:
marimba
Marimba has minimal system level dependencies, such as ffmpeg
, which are required for its operation. On Ubuntu you can
install ffmpeg
with:
sudo apt install ffmpeg
To set up a Marimba development environment, please refer to the Environment Setup Guide, which provides detailed instructions and guidelines for configuring your development environment.
Getting Started
Marimba offers a streamlined CLI that encompasses the entire post-acquisition data processing workflow. Below is a minimal demonstration of the key CLI commands required to progress through all the Marimba processing stages.
-
Create a new Marimba Project:
marimba new project MY-PROJECT cd MY-PROJECT
-
Create a new Marimba Pipeline:
marimba new pipeline MY-INSTRUMENT https://path.to/my-instrument-pipeline.git
-
Import new Marimba Collections:
marimba import COLLECTION-ONE '/path/to/collection/one/' marimba import COLLECTION-TWO '/path/to/collection/two/'
-
Process the imported Collections with the installed Pipelines:
marimba process
-
Package the FAIR image dataset:
marimba package MY-FAIR-DATASET --version 1.0 --contact-name "Keiko Abe" --contact-email "keiko.abe@email.com"
For additional details and advanced usage, please refer to the Overview and CLI Usage Guide.
Note: Keiko Abe is a renowned Japanese marimba player and composer, widely recognised for her role in establishing the marimba as a respected concert instrument.
Documentation
Marimba offers extensive documentation to support both users and developers:
Users
If you're interested in creating your own Pipelines to process image data, Marimba provides a comprehensive guide to help you get started. This documentation covers everything from setting up a Pipeline git repository to implementing custom processing pipelines.
-
Overview and CLI Usage Guide: Gain an architectural understanding of Marimba and explore the various CLI commands and options available to enhance pipeline management and execution, detailed in the comprehensive CLI usage guide.
-
Pipeline Implementation Guide: This guide offers a step-by-step tutorial on how to design and tailor Marimba Pipelines to suit your unique data processing requirements. From initial setup to advanced customization techniques, learn everything you need to efficiently use Marimba for your specific projects.
Developers
For developers who want to script Marimba using the CLI or leverage the Marimba API for more advanced integrations, we offer detailed documentation that covers all aspects of Marimba’s capabilities.
-
CLI Scripting Guide: Learn how to automate data processing workflows using Marimba's CLI. This guide provides detailed instructions and examples to help you streamline your data processing operations.
-
API Reference: Explore the Marimba API to integrate its functionalities into your applications or workflows. The reference includes detailed descriptions of Python API endpoints and their usage.
These resources are designed to help you make the most of Marimba, whether you are processing large datasets or integrating Marimba into your existing systems.
Contributing
Marimba is an open-source project, and we welcome feedback and contributions from the community. If you have ideas or suggestions to improve Marimba, we encourage you to submit them using our GitHub issue tracker. For enhancements or new features, we encourage you to fork the repository and submit a pull request. Please refer to the Contributing Guide for detailed guidelines on how to contribute.
License
This project is distributed under the CSIRO BSD/MIT license.
Contact
For inquiries related to this repository, please contact:
-
Chris Jackett
Software Engineer, CSIRO
Email: chris.jackett@csiro.au -
Kevin Barnard
Software Engineer, MBARI
Email: kbarnard@mbari.org
Acknowledgments
Marimba was developed as a collaborative effort between CSIRO and MBARI, two leading institutions in marine science and technology. The conceptual foundation of Marimba was formulated at CSIRO in late 2022. Substantial elements of its initial design and implementation were developed during the CSIRO Image Data Collection and Delivery Hackathon in early 2023, with further collaborative advancements between CSIRO and MBARI in late 2023. Marimba was open-sourced on GitHub and PyPI in mid-2024 and officially launched at the Marine Imaging Workshop 2024.
The development of this project has greatly benefited from the contributions of the following people:
- Chris Jackett - CSIRO Environment
- Kevin Barnard - MBARI
- Nick Mortimer - CSIRO Environment
- David Webb - CSIRO NCMI
- Aaron Tyndall - CSIRO NCMI
- Franzis Althaus - CSIRO Environment
- Candice Untiedt - CSIRO Environment
- Carlie Devine - CSIRO Environment
- Bec Gorton - CSIRO Environment
- Ben Scoulding - CSIRO Environment
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file marimba-0.4.6.tar.gz
.
File metadata
- Download URL: marimba-0.4.6.tar.gz
- Upload date:
- Size: 225.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.10.12 Linux/6.8.0-48-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 754eb53192650dee994d08ae7c5385f3fdad2b39c09745653e1bbd74408deffd |
|
MD5 | d3df69052fdfa192b616009f11a524fc |
|
BLAKE2b-256 | e1db1d286387a653a7a89af39c95509e8ad43943465b1b8b510843c015864783 |
File details
Details for the file marimba-0.4.6-py3-none-any.whl
.
File metadata
- Download URL: marimba-0.4.6-py3-none-any.whl
- Upload date:
- Size: 91.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.10.12 Linux/6.8.0-48-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c06af78e3257dd43842004f21da92e3dcd259d651c47831bb23a05b67f03bfb |
|
MD5 | 6d7fe22b237f253d9530813ab93e2592 |
|
BLAKE2b-256 | 422e2e72b49d9a48b6e270331d9d086e44747a435c43ea61796aa29152849a19 |