Skip to main content

Cassandra data loader for ML pipelines

Project description

Cassandra plugin for NVIDIA DALI

Overview

This plugin enables data loading from an Apache Cassandra NoSQL database to NVIDIA Data Loading Library (DALI) (which can be used to load and preprocess images for PyTorch or TensorFlow).

DALI compatibility

The plugin has been tested and is compatible with DALI v2.0.

Running the Docker container

The easiest way to test the cassandra-dali-plugin is by using the provided Docker images, which can be built and run easily with Docker Compose. The setup includes two containers:

To build and run the containers, use the following commands:

docker compose up --build -d
docker compose exec dali-cassandra fish

Note that Cassandra DB may take 1-2 minutes to become fully operational, even though the container starts almost immediately. Also, for improved performance and for data persistence, consider modifying docker-compose.yml to mount a host directory for Cassandra on a fast disk.

How to call the plugin

Once installed the plugin can be loaded with

import crs4.cassandra_utils
import nvidia.dali.plugin_manager as plugin_manager
import nvidia.dali.fn as fn
import pathlib

plugin_path = pathlib.Path(crs4.cassandra_utils.__path__[0])
plugin_path = plugin_path.parent.parent.joinpath("libcrs4cassandra.so")
plugin_path = str(plugin_path)
plugin_manager.load_library(plugin_path)

At this point the plugin can be integrated in a DALI pipeline, for example replacing a call to fn.readers.file with

images, labels = fn.crs4.cassandra(
    name="Reader", cassandra_ips=["cassandra_host"],
    table="imagenet.train_data", label_col="label", label_type="int",
    data_col="data", id_col="img_id",
    source_uuids=train_uuids, prefetch_buffers=2,
)

Below, we'll provide a full summary of the parameters' meanings. If you prefer to skip this section, here you can find some working examples.

Basic parameters

  • name: name of the module to be passed to DALI (e.g. "Reader")
  • cassandra_ips: list of IPs or hostnames pointing to the DB (e.g., ["cassandra_host"])
  • cassandra_port: Cassandra TCP port (default: 9042)
  • table: data table (e.g., imagenet.train_data)
  • label_col: name of the label column (e.g., label)
  • label_type: type of label: "int", "blob" or "none" ("int" is typically used for classification, "blob" for segmentation)
  • data_col: name of the data column (e.g., data)
  • id_col: name of the UUID column (e.g., img_id)
  • source_uuids: full list of UUIDs, as strings, to be retrieved

Authentication and authorization

Cassandra server provides a wide range of (non-mandatory) options for configuring authentication and authorization. Our plugin supports them by using the following parameters:

  • username: username for Cassandra
  • password: password for Cassandra
  • use_ssl: use SSL to encrypt the transfers: True or False
  • ssl_certificate: public key of the Cassandra server (e.g., "server.crt")
  • ssl_own_certificate: public key of the client (e.g., "client.crt")
  • ssl_own_key: private key of the client (e.g., "client.key")
  • ssl_own_key_pass: password protecting the private key of the client (e.g., "blablabla")
  • cloud_config: Astra-like configuration (e.g., {'secure_connect_bundle': 'secure-connect-blabla.zip'})

Their use is demonstrated in private_data.py file, used by the examples

Performance tuning

This plugin offers extensive internal parallelism that can be adjusted to enhance pipeline performance. Refer for example to this discussion on how to improve the throughput over a long fat network.

Data model

The main idea behind this plugin is that relatively small files can be efficiently stored and retrieved as BLOBs in a NoSQL DB. This enables scalability in data loading through prefetching and pipelining. Furthermore, it enables data to be stored in a separate location, potentially even at a significant distance from where it is processed. This capability also facilitates storing data along with a comprehensive set of associated metadata, which can be more conveniently utilized during machine learning.

For the sake of convenience and improved performance, we choose to store data and metadata in separate tables within the database. The metadata table will be utilized for selecting the images that need to be processed. These images are identified by UUIDs and stored as BLOBs in the data table. During the machine learning process, we will exclusively access the data table. Below, you will find examples of functional code for creating and populating these tables in the database.

Further details

A technical report which describes in detail our plugin performance, with a focus on high-latency connections, is available here.

Citation

@misc{versaci2025hidinglatenciesnetworkbasedimage,
      title={Hiding Latencies in Network-Based Image Loading for Deep Learning},
      author={Francesco Versaci and Giovanni Busonera},
      year={2025},
      eprint={2503.22643},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2503.22643},
}

Examples

Classification

See the following annotated example for details on how to use this plugin:

A variant of the same example implemented with PyTorch Lightning is available in:

Segmentation

A (less) annotated example for segmentation can be found in:

Multilabel

An example showing how to save and decode multilabels as serialized numpy tensors can be found in:

Split-file

An example of how to automatically create a single file with data split to feed the training application:

Inference with NVIDIA Triton

This plugin also supports efficient inference via NVIDIA Triton server:

Installation

You can install the plugin via pip:

pip install cassandra-dali-plugin

Building from source

cassandra-dali-plugin requires:

  • NVIDIA DALI
  • Cassandra C/C++ driver
  • Cassandra Python driver

Build prerequisites

The C++ plugin links against the following system libraries, which must be installed before building. The Cassandra C++ driver is now automatically fetched and compiled by the build system if not found on the system.

Library Debian/Ubuntu package Source
libuv libuv1-dev libuv.org
OpenSSL libssl-dev openssl.org
CMake cmake (>= 3.25.2) cmake.org

On Debian/Ubuntu, install the prerequisites with:

sudo apt-get install -y libuv1-dev libssl-dev cmake build-essential

Note: Build using clang or gcc-12. Other compilers may not be supported.

You can install the plugin using pip or uv:

# Install the plugin
pip install .

Authors

Cassandra Data Loader is developed by

License

cassandra-dali-plugin is licensed under the under the Apache License, Version 2.0. See LICENSE for further details.

Acknowledgment

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cassandra_dali_plugin-1.3.0.tar.gz (31.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cassandra_dali_plugin-1.3.0-cp312-cp312-manylinux_2_39_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.39+ x86-64

File details

Details for the file cassandra_dali_plugin-1.3.0.tar.gz.

File metadata

  • Download URL: cassandra_dali_plugin-1.3.0.tar.gz
  • Upload date:
  • Size: 31.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for cassandra_dali_plugin-1.3.0.tar.gz
Algorithm Hash digest
SHA256 98c46b7263a3c7fb25c2d8b2e96d1114b3fce7bc17fc9b5f2e3a960c233f5cb8
MD5 1ef225d57d2114ebf5b540bcca6dadb6
BLAKE2b-256 a4e1280ada0512fee91ef8d02d13a91c1d3c3ae0c1fbcac5f278a7ec73566904

See more details on using hashes here.

File details

Details for the file cassandra_dali_plugin-1.3.0-cp312-cp312-manylinux_2_39_x86_64.whl.

File metadata

  • Download URL: cassandra_dali_plugin-1.3.0-cp312-cp312-manylinux_2_39_x86_64.whl
  • Upload date:
  • Size: 5.9 MB
  • Tags: CPython 3.12, manylinux: glibc 2.39+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for cassandra_dali_plugin-1.3.0-cp312-cp312-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 952a8f3b43bdf3c2bfe6d98e1856f0447f0637b11271bfbea04ddd6b9c125b9a
MD5 71d8e13c788f3e09cce90f00da77b964
BLAKE2b-256 dbbad47b24149cfbb76b4e267122e04ba0e05dba8b43d7c53154b322ee783139

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page