Skip to main content

An in-memory immutable data manager

Project description

vineyard

vineyard: an in-memory immutable data manager

Vineyard CI Coverage Docs FAQ Discussion Slack License CII Best Practices FOSSA

PyPI Docker HUB Artifact HUB

Vineyard (v6d) is an innovative in-memory immutable data manager that offers out-of-the-box high-level abstractions and zero-copy in-memory sharing for distributed data in various big data tasks, such as graph analytics (e.g., GraphScope), numerical computing (e.g., Mars), and machine learning.

Vineyard is a CNCF sandbox project

Vineyard is a CNCF sandbox project and indeed made successful by its community.

Table of Contents

What is vineyard

Vineyard is specifically designed to facilitate zero-copy data sharing among big data systems. To illustrate this, let’s consider a typical machine learning task of time series prediction with LSTM. This task can be broken down into several steps:

  • First, we read the data from the file system as a pandas.DataFrame.

  • Next, we apply various preprocessing tasks, such as eliminating null values, to the dataframe.

  • Once the data is preprocessed, we define the model and train it on the processed dataframe using PyTorch.

  • Finally, we evaluate the performance of the model.

In a single-machine environment, pandas and PyTorch, despite being two distinct systems designed for different tasks, can efficiently share data with minimal overhead. This is achieved through an end-to-end process within a single Python script.

Comparing the workflow with and without vineyard

What if the input data is too large to be processed on a single machine?

As depicted on the left side of the figure, a common approach is to store the data as tables in a distributed file system (e.g., HDFS) and replace pandas with ETL processes using SQL over a big data system such as Hive and Spark. To share the data with PyTorch, the intermediate results are typically saved back as tables on HDFS. However, this can introduce challenges for developers.

  1. For the same task, users must program for multiple systems (SQL & Python).

  2. Data can be polymorphic. Non-relational data, such as tensors, dataframes, and graphs/networks (in GraphScope) are becoming increasingly common. Tables and SQL may not be the most efficient way to store, exchange, or process them. Transforming the data from/to “tables” between different systems can result in significant overhead.

  3. Saving/loading the data to/from external storage incurs substantial memory-copies and IO costs.

Vineyard addresses these issues by providing:

  1. In-memory distributed data sharing in a zero-copy fashion to avoid introducing additional I/O costs by leveraging a shared memory manager derived from plasma.

  2. Built-in out-of-the-box high-level abstractions to share distributed data with complex structures (e.g., distributed graphs) with minimal extra development cost, while eliminating transformation costs.

As depicted on the right side of the above figure, we demonstrate how to integrate vineyard to address the task in a big data context.

First, we utilize Mars (a tensor-based unified framework for large-scale data computation that scales Numpy, Pandas, and Scikit-learn) to preprocess the raw data, similar to the single-machine solution, and store the preprocessed dataframe in vineyard.

single

data_csv = pd.read_csv('./data.csv', usecols=[1])

distributed

import mars.dataframe as md
dataset = md.read_csv('hdfs://server/data_full', usecols=[1])
# after preprocessing, save the dataset to vineyard
vineyard_distributed_tensor_id = dataset.to_vineyard()

Then, we modify the training phase to get the preprocessed data from vineyard. Here vineyard makes the sharing of distributed data between Mars and PyTorch just like a local variable in the single machine solution.

single

data_X, data_Y = create_dataset(dataset)

distributed

client = vineyard.connect(vineyard_ipc_socket)
dataset = client.get(vineyard_distributed_tensor_id).local_partition()
data_X, data_Y = create_dataset(dataset)

Finally, we execute the training phase in a distributed manner across the cluster.

From this example, it is evident that with vineyard, the task in the big data context can be addressed with only minor adjustments to the single-machine solution. Compared to existing approaches, vineyard effectively eliminates I/O and transformation overheads.

Features

Efficient In-Memory Immutable Data Sharing

Vineyard serves as an in-memory immutable data manager, enabling efficient data sharing across different systems via shared memory without additional overheads. By eliminating serialization/deserialization and IO costs during data exchange between systems, Vineyard significantly improves performance.

Out-of-the-Box High-Level Data Abstractions

Computation frameworks often have their own data abstractions for high-level concepts. For example, tensors can be represented as torch.tensor, tf.Tensor, mxnet.ndarray, etc. Moreover, every graph processing engine has its unique graph structure representation.

The diversity of data abstractions complicates data sharing. Vineyard addresses this issue by providing out-of-the-box high-level data abstractions over in-memory blobs, using hierarchical metadata to describe objects. Various computation systems can leverage these built-in high-level data abstractions to exchange data with other systems in a computation pipeline concisely and efficiently.

Stream Pipelining for Enhanced Performance

A computation doesn’t need to wait for all preceding results to arrive before starting its work. Vineyard provides a stream as a special kind of immutable data for pipelining scenarios. The preceding job can write immutable data chunk by chunk to Vineyard while maintaining data structure semantics. The successor job reads shared-memory chunks from Vineyard’s stream without extra copy costs and triggers its work. This overlapping reduces the overall processing time and memory consumption.

Versatile Drivers for Common Tasks

Many big data analytical tasks involve numerous boilerplate routines that are unrelated to the computation itself, such as various IO adapters, data partition strategies, and migration jobs. Since data structure abstractions usually differ between systems, these routines cannot be easily reused.

Vineyard provides common manipulation routines for immutable data as drivers. In addition to sharing high-level data abstractions, Vineyard extends the capability of data structures with drivers, enabling out-of-the-box reusable routines for the boilerplate parts in computation jobs.

Try Vineyard

Vineyard is available as a python package and can be effortlessly installed using pip:

pip3 install vineyard

For comprehensive and up-to-date documentation, please visit https://v6d.io.

If you wish to build vineyard from source, please consult the Installation guide. For instructions on building and running unittests locally, refer to the Contributing section.

After installation, you can initiate a vineyard instance using the following command:

python3 -m vineyard

For further details on connecting to a locally deployed vineyard instance, please explore the Getting Started guide.

Deploying on Kubernetes

Vineyard is designed to efficiently share immutable data between different workloads, making it a natural fit for cloud-native computing. By embracing cloud-native big data processing and Kubernetes, Vineyard enables efficient distributed data sharing in cloud-native environments while leveraging the scaling and scheduling capabilities of Kubernetes.

To effectively manage all components of Vineyard within a Kubernetes cluster, we have developed the Vineyard Operator. For more information, please refer to the Vineyard Operator documentation.

FAQ

Vineyard shares many similarities with other open-source projects, yet it also has distinct features. We often receive the following questions about Vineyard:

  • Q: Can clients access the data while the stream is being filled?

    Sharing one piece of data among multiple clients is a target scenario for Vineyard, as the data stored in Vineyard is immutable. Multiple clients can safely consume the same piece of data through memory sharing, without incurring extra costs or additional memory usage from copying data back and forth.

  • Q: How does Vineyard avoid serialization/deserialization between systems in different languages?

    Vineyard provides high-level data abstractions (e.g., ndarrays, dataframes) that can be naturally shared between different processes, eliminating the need for serialization and deserialization between systems in different languages.

  • … …

For more detailed information, please refer to our FAQ page.

Get Involved

  • Join the CNCF Slack and participate in the #vineyard channel for discussions and collaboration.

  • Familiarize yourself with our contribution guide to understand the process of contributing to vineyard.

  • If you encounter any bugs or issues, please report them by submitting a GitHub issue or engage in a conversation on Github discussion.

  • We welcome and appreciate your contributions! Submit them using pull requests.

Thank you in advance for your valuable contributions to vineyard!

Publications

  • Wenyuan Yu, Tao He, Lei Wang, Ke Meng, Ye Cao, Diwen Zhu, Sanhong Li, Jingren Zhou. Vineyard: Optimizing Data Sharing in Data-Intensive Analytics. ACM SIG Conference on Management of Data (SIGMOD), industry, 2023.

Acknowledgements

We thank the following excellent open-source projects:

  • apache-arrow, a cross-language development platform for in-memory analytics.

  • boost-leaf, a C++ lightweight error augmentation framework.

  • cityhash, CityHash, a family of hash functions for strings.

  • ctti, a C++ compile-time type information library.

  • dlmalloc, Doug Lea’s memory allocator.

  • etcd-cpp-apiv3, a C++ API for etcd’s v3 client API.

  • flat_hash_map, an efficient hashmap implementation.

  • libcuckoo, libcuckoo, a high-performance, concurrent hash table.

  • mimalloc, a general purpose allocator with excellent performance characteristics.

  • nlohmann/json, a json library for modern c++.

  • pybind11, a library for seamless operability between C++11 and Python.

  • s3fs, a library provide a convenient Python filesystem interface for S3.

  • skywalking-infra-e2e A generation End-to-End Testing framework.

  • skywalking-swck A kubernetes operator for the Apache Skywalking.

  • wyhash, C++ wrapper around wyhash and wyrand.

License

Vineyard is distributed under Apache License 2.0. Please note that third-party libraries may not have the same license as vineyard.

FOSSA Status

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vineyard_bdist-0.14.2-py3-none-macosx_12_0_arm64.whl (55.9 MB view details)

Uploaded Python 3macOS 12.0+ ARM64

vineyard_bdist-0.14.2-py3-none-macosx_11_0_arm64.whl (55.9 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

vineyard_bdist-0.14.2-py3-none-macosx_10_9_universal2.whl (46.0 MB view details)

Uploaded Python 3macOS 10.9+ universal2 (ARM64, x86-64)

vineyard_bdist-0.14.2-1-py3-none-macosx_11_0_arm64.whl (52.3 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file vineyard_bdist-0.14.2-py3-none-manylinux2014_x86_64.whl.

File metadata

  • Download URL: vineyard_bdist-0.14.2-py3-none-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 22.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/65.7.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for vineyard_bdist-0.14.2-py3-none-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 18a67ad5229674cbff87de54c943d0d20bf2c303d8f7e5b36b9014f9c1186bfd
MD5 43cb4b56d8082007c626ffb1fbf824e6
BLAKE2b-256 fdb5920e325735948fd651de57b8a85c47b421d1cdbd6109f051e10002741132

See more details on using hashes here.

File details

Details for the file vineyard_bdist-0.14.2-py3-none-manylinux2014_aarch64.whl.

File metadata

  • Download URL: vineyard_bdist-0.14.2-py3-none-manylinux2014_aarch64.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/65.7.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for vineyard_bdist-0.14.2-py3-none-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 218e82de4670b862d5581bddca22bb6970cd0a3a952b9f53f4422d21af7a7e60
MD5 ef2ac29dda52d986cb4bb5b7997eae28
BLAKE2b-256 cae7f5da1b202eb81990d38314942a767645585ba918515aa813ab35fadc9662

See more details on using hashes here.

File details

Details for the file vineyard_bdist-0.14.2-py3-none-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for vineyard_bdist-0.14.2-py3-none-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 1b2e5987c2f594cc05964cda6772e8a302256275ae3dca5562a292d0168daf48
MD5 3166e0041efc9601653d3a30fde00e5d
BLAKE2b-256 ef13ca20851c7cd02130f415e709876ac1c3e3550feb0f1fda76103dd07464ec

See more details on using hashes here.

File details

Details for the file vineyard_bdist-0.14.2-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vineyard_bdist-0.14.2-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e021d0e82da58a880b162f676acf182886374df19b650cce1b282fe43110f421
MD5 1585c4ea0f83f19b130a1690819310f6
BLAKE2b-256 d5e5dcb36080d758bd0f85a905414a3cc7bc7c29174c6f61145de6dae4b53371

See more details on using hashes here.

File details

Details for the file vineyard_bdist-0.14.2-py3-none-macosx_10_9_universal2.whl.

File metadata

  • Download URL: vineyard_bdist-0.14.2-py3-none-macosx_10_9_universal2.whl
  • Upload date:
  • Size: 46.0 MB
  • Tags: Python 3, macOS 10.9+ universal2 (ARM64, x86-64)
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/65.7.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for vineyard_bdist-0.14.2-py3-none-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 fa04dcd00554f732eaff1d6d37630956cc2902115ac90685fdbb9e36ed080fff
MD5 be26a8c1ac44f7a965c365e1ae9f6d7a
BLAKE2b-256 2c9f918a5906eb2447ac64bcc2608d99be9fc952eec8e168b328ca5119dde658

See more details on using hashes here.

File details

Details for the file vineyard_bdist-0.14.2-1-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vineyard_bdist-0.14.2-1-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 46f6bbae562dacfe0ca15e82d329887e6444a686111b83b107a03c290bdb5910
MD5 9f97aa00c2a502a687516e0599dd4ebb
BLAKE2b-256 d7cefd19ad78cad664fd229b9b6bf0999fdd0d45775c3ecdab8a29b5bc257898

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page