Skip to main content

Package supports all machine learning functionality for TileDB Embedded and TileDB Cloud

Project description

TileDB logo

TileDB-ML CI Coverage Badge

TileDB-ML

TileDB-ML is the repository that contains all machine learning oriented functionality TileDB supports. In this repo, we explain how someone can employ TileDB for machine learning oriented data management problems, and which are the next steps we have in mind. Here, we would firstly like to highlight our perspective on the relation of TileDB with general machine learning oriented data management problems and how TileDB engine could be the solution for efficiently storing any kind of machine learning data, i.e., from raw images, text, audio, time series and SAR to features and machine learning models. Before you proceed further, please take a quick look on our medium blog post, which targets to explain in great detail how TileDB addresses many machine learning data format requirements, overcoming the drawbacks of the other candidate formats, and take this opportunity to solicit feedback and contributions from the community.

Description

As mentioned above, this repository contains all machine learning oriented functionality TileDB supports. Specifically, code that can (or will be able to):

  • Save machine learning models as TileDB arrays (At the moment we provide implementations for saving Tensorflow Keras, PyTorch and Scikit-Learn models.)

  • Load machine learning models from TileDB arrays.

  • Read features, in order to train machine learning models, from TileDB arrays directly to machine learning framework's data APIs. We already support the Tensorflow and PyTorch data APIs with the use of Python generators for Dense and Sparse TileDB arrays, and we are similarly working on Scikit-Learn Pipelines which will be out soon.

Examples

We provide some detailed notebook examples on how to save and load machine learning models as TileDB arrays (also on TileDB-Cloud) and explain why this is useful in order to create simple and flexible model registries with TileDB.

We also provide detailed notebook examples on how to train Tensorflow and PyTorch models with the use of our Data APIs support for Dense and Sparse TileDB arrays.

Installation

TileDB-ML can be installed:

Quick Installation

  • from source by cloning the Git repository:

    git clone https://github.com/TileDB-Inc/TileDB-ML.git
    cd TileDB-ML
    
    # In case you want to install and check all frameworks. If you
    # use zsh replace .[full] with .\[full\]
    pip install -e .[full]
    
    # In case you want to install and check Tensorflow only. If you
    # use zsh replace .[tensorflow] with .\[tensorflow\]
    pip install -e .[tensorflow]
    
    # In case you want to install and check PyTorch only. If you
    # use zsh replace .[pytorch] with .\[pytorch\]
    pip install -e .[pytorch]
    
    # In case you want to install and check Scikit-Learn only. If you
    # use zsh replace .[sklearn] with .\[sklearn\]
    pip install -e .[sklearn]  
    
    # In case you want to try any of the aforementioned machine learning framework
    # on TileDB-Cloud try one of the follwoing.
    pip install -e .[tensorflow_cloud]
    pip install -e .[pytorch_cloud]
    pip install -e .[sklearn_cloud]
    
  • with pip from git:

    pip install git+https://github.com/TileDB-Inc/TileDB-ML.git@master
    
  • from PyPi:

pip install tiledb-ml

The above command will just install the basic dependency of tiledb-ml, hence tiledb. In order to install the integration for a specific framework you need to use:

pip install tiledb-ml[pytorch] # e.g. For checking only the Pytorch integration

Checking all the supported frameworks you will need to use:

pip install tiledb-ml[full]

The above commands apply to bash shell in case you use zsh you will need to escape the bracket character like the following for example:

pip install tiledb-ml\[pytorch\]
  • You may run the test suite with:
    python setup.py test
    

Roadmap

We are already working on the following:

  • C++ integration of TileDB with the Tensorflow Data API through tensorflow-io.
  • Readers from TileDB arrays to other popular machine learning framework Data APIs, as mentioned above.
  • Model save/load support for other popular machine learning frameworks like XGBoost and CatBoost.

Our ultimate goal is ALL machine learning data, from raw data (text, images, audio), to features (Feature Store) and models (Model Registry), represented, stored and managed in one Data Engine, i.e, TileDB.

Note

Here we would like to highlight that our current implementations are not optimal, and they don't support the aforementioned machine learning frameworks 100%, e.g., serialization of model parts like numpy arrays, takes place with Pickle (which is far from optimal) because of its Python Only nature and insecurity as described here. We mainly provide a proof of concept, showing the universal data management ability of TileDB, and how elegantly applies in machine learning data of any kind. Optimizations will follow as soon as possible.

In any case, note that the TileDB-ML repository is under development, and the API is subject to change.

Contributing

We welcome all contributions! Please read the contributing guidelines before submitting pull requests.

Copyright

The TileDB-ML package is Copyright 2018-2021 TileDB, Inc

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiledb-ml-0.2.4.tar.gz (67.1 kB view details)

Uploaded Source

Built Distribution

tiledb_ml-0.2.4-py3-none-any.whl (38.3 kB view details)

Uploaded Python 3

File details

Details for the file tiledb-ml-0.2.4.tar.gz.

File metadata

  • Download URL: tiledb-ml-0.2.4.tar.gz
  • Upload date:
  • Size: 67.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7

File hashes

Hashes for tiledb-ml-0.2.4.tar.gz
Algorithm Hash digest
SHA256 038102e6ebba51ad719a32c08ba07b80d284b5b1010c80cb88adb54363c7c87f
MD5 8c7a22331d270f45cd07569dc3f7566a
BLAKE2b-256 32dc57ec1b57934fca2f7e400a78f09407bd5107134adb97aff420dac3ad32d5

See more details on using hashes here.

File details

Details for the file tiledb_ml-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: tiledb_ml-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 38.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7

File hashes

Hashes for tiledb_ml-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 73e3861a201cc4d631489375d285109817c9a7af850551f7b5166f5f9237a7e0
MD5 0b032de94e1fc2763f8b801bf313dd23
BLAKE2b-256 2fb1fa66236c9f89f59b1b6e9c5f4373f64f4f5418b350da14297f89fd97690f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page