Skip to main content

FlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow provenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments.

Project description

Build PyPI Tests Code Formatting License: MIT Code style: black

FlowCept

FlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow provenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments.

FlowCept is intended to address scenarios where multiple workflows in a science campaign or in an enterprise run and generate important data to be analyzed in an integrated manner. Since these workflows may use different data manipulation tools (e.g., provenance or lineage capture tools, database systems, performance profiling tools) or can be executed within different parallel computing systems (e.g., Dask, Spark, Workflow Management Systems), its key differentiator is the capability to seamless and automatically integrate data from various workflows using data observability. It builds an integrated data view at runtime enabling end-to-end exploratory data analysis and monitoring. It follows W3C PROV recommendations for its data schema. It does not require changes in user codes or systems (i.e., instrumentation). All users need to do is to create adapters for their systems or tools, if one is not available yet.

Currently, FlowCept provides adapters for: Dask, MLFlow, TensorBoard, and Zambeze.

See the Jupyter Notebooks for utilization examples.

See the Contributing file for guidelines to contribute with new adapters. Note that we may use the term 'plugin' in the codebase as a synonym to adapter. Future releases should standardize the terminology to use adapter.

Install and Setup:

  1. Install FlowCept:

pip install .[full] in this directory (or pip install flowcept[full]).

For convenience, this will install all dependencies for all adapters. But it can install dependencies for adapters you will not use. For this reason, you may want to install like this: pip install .[adapter_key1,adapter_key2] for the adapters we have implemented, e.g., pip install .[dask]. See extra_requirements if you want to install the dependencies individually.

  1. Start MongoDB and Redis:

To enable the full advantages of FlowCept, the user needs to run Redis, as FlowCept's message queue system, and MongoDB, as FlowCept's main database system. The easiest way to start Redis and MongoDB is by using the docker-compose file for its dependent services: MongoDB and Redis. You only need RabbitMQ if you want to observe Zambeze messages as well.

  1. Define the settings (e.g., routes and ports) accordingly in the settings.yaml file.

  2. Start the observation using the Controller API, as shown in the Jupyter Notebooks.

  3. To use FlowCept's Query API, see utilization examples in the notebooks.

Performance Tuning for Performance Evaluation

In the settings.yaml file, the following variables might impact interception performance:

main_redis:
  buffer_size: 50
  insertion_buffer_time_secs: 5

plugin:
  enrich_messages: false

And other variables depending on the Plugin. For instance, in Dask, timestamp creation by workers add interception overhead.

See also

Acknowledgement

This research uses resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flowcept-0.1.9.tar.gz (40.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flowcept-0.1.9-py3-none-any.whl (53.7 kB view details)

Uploaded Python 3

File details

Details for the file flowcept-0.1.9.tar.gz.

File metadata

  • Download URL: flowcept-0.1.9.tar.gz
  • Upload date:
  • Size: 40.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for flowcept-0.1.9.tar.gz
Algorithm Hash digest
SHA256 846e753101f4246bb7dbc1d872b2894e7c649d85ad93e0d2c3d8a4a4c7452335
MD5 ae2477cc0e17d8bf3bf94b0626418527
BLAKE2b-256 28a6fe474fa6f01f9d7733cbc75662c566c84af25d0f1624c0f1b06d3d63adc7

See more details on using hashes here.

File details

Details for the file flowcept-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: flowcept-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 53.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for flowcept-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 ce0bad37a02cad6cd322515afb455f069cc9bab3887cc9b58e1624fc94e6f603
MD5 9a8278e2914be19f62a7bb215b8e42f2
BLAKE2b-256 4cb2f53ad4a5140c32b09ad127a755196a8dda93b85c2c6fbf13ea99f20298dd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page