Skip to main content

package for machine learning meta-data management and analysis

Project description

ArangoML Pipeline

ArangoML Pipeline is a common and extensible Metadata Layer for Machine Learning Pipelines which allows Data Scientists and DataOps to manage all information related to their ML pipeline in one place.

News: ArangoML Pipeline Cloud is offering a no-setup, free-to-try managed service for ArangpML Pipeline. A ArangoML Pipeline Cloud tutorial is also available without any installation or signup.

Quick Start

To get started without any installations (using ArangoML Pipeline Cloud) , click : Open In Colab

The examples folder contains notebooks that illustrate the features of Arangopipe.

Overview

When machine learning pipelines are created, for example using TensorFlow Extended or Kubeflow, the capture (and access to) of metadata across the pipeline is vital. Typically, each component of such an ML pipeline produces or requires metadata, for example:

  • Data storage: size, location, creation date, checksum, ...
  • Feature Store (processed dataset): transformation, version, base datasets ...
  • Model Training: training/validation performance, training duration, ...
  • Model Serving: model linage, serving performance, ...

Instead of each component storing its metadata, a common metadata layer simplifies data management and permits querying the entire pipeline. ArangoDB, being a multi model database, supporting both efficient document and graph data models within a single database engine, is a great fit for such a metadata layer, for the following reasons:

  • The metadata produced by each component is typically unstructured (e.g., TensorFlow's training metadata is different from PyTorch's metadata) and hence a great fit for document databases
  • The relationship between the different entities (i.e., metadata) can be neatly expressed as graphs (e.g., this model has been trained by run_34 on dataset_y)
  • Metadata queries are easily expressed as graph traversals (e.g., all models which have been derived from dataset_y)

Use Cases

ArangoML Pipeline can benefit many scenarios, such as:

  • Capture of lineage information (e.g., Which dataset influences which model?)
  • Capture of audit information (e.g, A given model was training two months ago with the following training/validation performance)
  • Reproducible model training
  • Model serving policy (e.g., Which model should be deployed in production based on training statistics)
  • Extension of existing ML pipelines through simple python/HTTP API

Documentation

Please refer to the Arangopipe documentation for further information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arangopipe-0.0.6.9.5.tar.gz (18.5 kB view hashes)

Uploaded Source

Built Distribution

arangopipe-0.0.6.9.5-py3-none-any.whl (20.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page