Skip to main content

A system to integrate data from multiple workflows.

Project description

Build PyPI Tests Code Formatting License: MIT Code style: black

FlowCept

FlowCept is a runtime data integration system that aims at empowering any data generation system to capture workflow provenance data using data observability, with minimal (often no) changes in the target system code. Thus, it is able to integrate data from multiple workflows, enabling users to understand complex, heterogeneous, large-scale data coming from various sources in federated environments.

FlowCept is intended to address scenarios where multiple workflows in a science campaign or in an enterprise run and generate important data to be analyzed in an integrated manner. Since these workflows may use different data generation tools or can be executed within different parallel computing systems (e.g., Dask, Spark, workflow management systems), its key differentiator is the capability to seamless integrate multi-workflow data from various sources using data observability. It builds an integrated data view at runtime of these multi-workflow data following W3C PROV recommendations for its data schema. It does not require changes in user codes or systems (i.e., instrumentation). All users need to do is to create adapters for their systems or tools, if one is not available yet.

Currently, FlowCept provides adapters for: Dask, MLFlow, TensorBoard, and Zambeze.

See the Jupyter Notebooks for utilization examples.

See the Contributing file for guidelines to contribute with new adapters. Note that we may use the term 'plugin' in the codebase as a synonym to adapter. Future releases should standardize the terminology to use adapter.

Install and Setup:

  1. Install FlowCept:

pip install .[full] in this directory (or pip install flowcept[full]).

For convenience, this will install all dependencies for all adapters. But it can install dependencies for adapters you will not use. For this reason, you may want to install like this: pip install .[adapter_key1,adapter_key2] for the adapters we have implemented, e.g., pip install .[dask]. See extra_requirements if you want to install the dependencies individually.

  1. Start MongoDB and Redis:

To enable the full advantages of FlowCept, the user needs to run Redis, as FlowCept's message queue system, and MongoDB, as FlowCept's main database system. The easiest way to start Redis and MongoDB is by using the docker-compose file for its dependent services: MongoDB and Redis. You only need RabbitMQ if you want to observe Zambeze messages as well.

  1. Define the settings (e.g., routes and ports) accordingly in the settings.yaml file.

  2. Start the observation using the Controller API, as shown in the Jupyter Notebooks.

  3. To use FlowCept's Query API, see utilization examples in the notebooks.

Performance Tuning for Performance Evaluation

In the settings.yaml file, the following variables might impact interception performance:

main_redis:
  buffer_size: 50
  insertion_buffer_time_secs: 5

plugin:
  enrich_messages: false

And other variables depending on the Plugin. For instance, in Dask, timestamp creation by workers add interception overhead.

Plugins-specific info

You can run pip install flowcept[plugin_name] to install requirements for a specific plugin, instead of installing the whole package.

RabbitMQ for Zambeze plugin

$ docker run -it --rm --name rabbitmq -d -p 5672:5672 -p 15672:15672 rabbitmq:3.11-management

Tensorboard

If you're on mac, pip install may not work out of the box because of Tensorflow library. You may need to pip install tensorflow-macos instead of the tensorflow lib available in the tensorboard-requirements.

See also

Acknowledgement

This research uses resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flowcept-0.1.2.tar.gz (34.8 kB view details)

Uploaded Source

Built Distribution

flowcept-0.1.2-py3-none-any.whl (47.5 kB view details)

Uploaded Python 3

File details

Details for the file flowcept-0.1.2.tar.gz.

File metadata

  • Download URL: flowcept-0.1.2.tar.gz
  • Upload date:
  • Size: 34.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.4

File hashes

Hashes for flowcept-0.1.2.tar.gz
Algorithm Hash digest
SHA256 47ed1d1b512d3bb5c950d3d757578d65cf57a8b0b69276c3a1e009ccc38604c5
MD5 270d3f5a28e17e4d9d9bde516369774a
BLAKE2b-256 b10ca88d97ae179b24ee8d252b80447f25ac297f18e7b97972ac0f04333c6e82

See more details on using hashes here.

File details

Details for the file flowcept-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: flowcept-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 47.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.4

File hashes

Hashes for flowcept-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7fc4e0e2243196fc608cb196a5563cce35d8dfaa0e2bfacac6b9c0c048059428
MD5 20eb9609b899a58b9300b73e6ae3ddce
BLAKE2b-256 58e0c292dff3227ca21230bcbf1843827e8835e3a1bf9b356b121c4346951e6f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page