Skip to main content

FlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow provenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments.

Project description

Build PyPI Tests Code Formatting License: MIT Code style: black

FlowCept

FlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow provenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments.

FlowCept is intended to address scenarios where multiple workflows in a science campaign or in an enterprise run and generate important data to be analyzed in an integrated manner. Since these workflows may use different data manipulation tools (e.g., provenance or lineage capture tools, database systems, performance profiling tools) or can be executed within different parallel computing systems (e.g., Dask, Spark, Workflow Management Systems), its key differentiator is the capability to seamless and automatically integrate data from various workflows using data observability. It builds an integrated data view at runtime enabling end-to-end exploratory data analysis and monitoring. It follows W3C PROV recommendations for its data schema. It does not require changes in user codes or systems (i.e., instrumentation). All users need to do is to create adapters for their systems or tools, if one is not available yet. In addition to observability, we provide instrumentation options for convenience. For example, by adding a @flowcept_task decorator on functions, FlowCept will observe their executions when they run. Also, we provide special features for PyTorch modules. Adding @torch_task to them will enable extra model inspection to be captured and integrated in the database at runtime.

Currently, FlowCept provides adapters for: Dask, MLFlow, TensorBoard, and Zambeze.

See the Jupyter Notebooks for utilization examples.

See the Contributing file for guidelines to contribute with new adapters. Note that we may use the term 'plugin' in the codebase as a synonym to adapter. Future releases should standardize the terminology to use adapter.

Install and Setup:

  1. Install FlowCept:

pip install .[full] in this directory (or pip install flowcept[full]).

For convenience, this will install all dependencies for all adapters. But it can install dependencies for adapters you will not use. For this reason, you may want to install like this: pip install .[adapter_key1,adapter_key2] for the adapters we have implemented, e.g., pip install .[dask]. See extra_requirements if you want to install the dependencies individually.

  1. Start MongoDB and Redis:

To enable the full advantages of FlowCept, one needs to start a Redis and MongoDB instances. FlowCept uses Redis as its message queue system and Mongo for its persistent database. For convenience, we set up a docker-compose file deployment file for this. Run docker-compose -f deployment/compose.yml up. RabbitMQ is only needed if Zambeze messages are observed, otherwise, feel free to comment out RabbitMQ service in the compose file.

  1. Define the settings (e.g., routes and ports) accordingly in the settings.yaml file. You may need to set the environment variable FLOWCEPT_SETTINGS_PATH with the absolute path to the settings file.

  2. Start the observation using the Controller API, as shown in the Jupyter Notebooks.

  3. To use FlowCept's Query API, see utilization examples in the notebooks.

Performance Tuning for Performance Evaluation

In the settings.yaml file, the following variables might impact interception performance:

main_redis:
  buffer_size: 50
  insertion_buffer_time_secs: 5

plugin:
  enrich_messages: false

And other variables depending on the Plugin. For instance, in Dask, timestamp creation by workers add interception overhead. As we evolve the software, other variables that impact overhead appear and we might not stated them in this README file yet. If you are doing extensive performance evaluation experiments using this software, please reach out to us (e.g., create an issue in the repository) for hints on how to reduce the overhead of our software.

Install AMD GPU Lib

On the machines that have AMD GPUs, we use the official AMD ROCM library to capture GPU runtime data. Unfortunately, this library is not available as a pypi/conda package, so you must manually install it. See instructions in the link: https://rocm.docs.amd.com/projects/amdsmi/en/latest/

See also

Cite us

If you used FlowCept for your research, consider citing our paper.

Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability
R. Souza, T. Skluzacek, S. Wilkinson, M. Ziatdinov, and R. da Silva
19th IEEE International Conference on e-Science, 2023.

Bibtex:

@inproceedings{souza2023towards,  
  author = {Souza, Renan and Skluzacek, Tyler J and Wilkinson, Sean R and Ziatdinov, Maxim and da Silva, Rafael Ferreira},
  booktitle = {IEEE International Conference on e-Science},
  doi = {10.1109/e-Science58273.2023.10254822},
  link = {https://doi.org/10.1109/e-Science58273.2023.10254822},
  pdf = {https://arxiv.org/pdf/2308.09004.pdf},
  title = {Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability},
  year = {2023}
}

Disclaimer & Get in Touch

Please note that this a research software. We encourage you to give it a try and use it with your own stack. We are continuously working on improving documentation and adding more examples and notebooks, but we are still far from a good documentation covering the whole system. If you are interested in working with FlowCept in your own scientific project, we can give you a jump start if you reach out to us. Feel free to create an issue, create a new discussion thread or drop us an email (we trust you'll find a way to reach out to us :wink: ).

Acknowledgement

This research uses resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flowcept-0.3.6.tar.gz (79.6 kB view details)

Uploaded Source

Built Distribution

flowcept-0.3.6-py3-none-any.whl (102.0 kB view details)

Uploaded Python 3

File details

Details for the file flowcept-0.3.6.tar.gz.

File metadata

  • Download URL: flowcept-0.3.6.tar.gz
  • Upload date:
  • Size: 79.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for flowcept-0.3.6.tar.gz
Algorithm Hash digest
SHA256 4f2409240d805ae25c098351bbbeaf03827415d9f445810fe56dfdcf8dc67a89
MD5 be1ee57bf3b3b20596a38d0a63b94599
BLAKE2b-256 dd11aef4b10362762400ce657dc0a066bc0d02f4bcb6ad7e7788f4f57405bd0a

See more details on using hashes here.

File details

Details for the file flowcept-0.3.6-py3-none-any.whl.

File metadata

  • Download URL: flowcept-0.3.6-py3-none-any.whl
  • Upload date:
  • Size: 102.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for flowcept-0.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 17f77ad08b9c8cf269c4cbe4eb0f3ee19309b9edb7dbf0d0693b502f80a4b705
MD5 6e6a75107ebc63a0a9544d8a78814b5e
BLAKE2b-256 c131ff103b1b882491eb76fe5f521ed78abfa9014947714556ea5be7228b3f08

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page