Skip to main content

An in-memory immutable data manager

Project description

vineyard

vineyard: an in-memory immutable data manager

Vineyard CI Coverage Docs FAQ Discussion Slack License CII Best Practices FOSSA

PyPI crates.io Docker HUB Artifact HUB ACM DL

Vineyard (v6d) is an innovative in-memory immutable data manager that offers out-of-the-box high-level abstractions and zero-copy in-memory sharing for distributed data in various big data tasks, such as graph analytics (e.g., GraphScope), numerical computing (e.g., Mars), and machine learning.

Vineyard is a CNCF sandbox project

Vineyard is a CNCF sandbox project and indeed made successful by its community.

Table of Contents

What is vineyard

Vineyard is specifically designed to facilitate zero-copy data sharing among big data systems. To illustrate this, let’s consider a typical machine learning task of time series prediction with LSTM. This task can be broken down into several steps:

  • First, we read the data from the file system as a pandas.DataFrame.

  • Next, we apply various preprocessing tasks, such as eliminating null values, to the dataframe.

  • Once the data is preprocessed, we define the model and train it on the processed dataframe using PyTorch.

  • Finally, we evaluate the performance of the model.

In a single-machine environment, pandas and PyTorch, despite being two distinct systems designed for different tasks, can efficiently share data with minimal overhead. This is achieved through an end-to-end process within a single Python script.

Comparing the workflow with and without vineyard

What if the input data is too large to be processed on a single machine?

As depicted on the left side of the figure, a common approach is to store the data as tables in a distributed file system (e.g., HDFS) and replace pandas with ETL processes using SQL over a big data system such as Hive and Spark. To share the data with PyTorch, the intermediate results are typically saved back as tables on HDFS. However, this can introduce challenges for developers.

  1. For the same task, users must program for multiple systems (SQL & Python).

  2. Data can be polymorphic. Non-relational data, such as tensors, dataframes, and graphs/networks (in GraphScope) are becoming increasingly common. Tables and SQL may not be the most efficient way to store, exchange, or process them. Transforming the data from/to “tables” between different systems can result in significant overhead.

  3. Saving/loading the data to/from external storage incurs substantial memory-copies and IO costs.

Vineyard addresses these issues by providing:

  1. In-memory distributed data sharing in a zero-copy fashion to avoid introducing additional I/O costs by leveraging a shared memory manager derived from plasma.

  2. Built-in out-of-the-box high-level abstractions to share distributed data with complex structures (e.g., distributed graphs) with minimal extra development cost, while eliminating transformation costs.

As depicted on the right side of the above figure, we demonstrate how to integrate vineyard to address the task in a big data context.

First, we utilize Mars (a tensor-based unified framework for large-scale data computation that scales Numpy, Pandas, and Scikit-learn) to preprocess the raw data, similar to the single-machine solution, and store the preprocessed dataframe in vineyard.

single

data_csv = pd.read_csv('./data.csv', usecols=[1])

distributed

import mars.dataframe as md
dataset = md.read_csv('hdfs://server/data_full', usecols=[1])
# after preprocessing, save the dataset to vineyard
vineyard_distributed_tensor_id = dataset.to_vineyard()

Then, we modify the training phase to get the preprocessed data from vineyard. Here vineyard makes the sharing of distributed data between Mars and PyTorch just like a local variable in the single machine solution.

single

data_X, data_Y = create_dataset(dataset)

distributed

client = vineyard.connect(vineyard_ipc_socket)
dataset = client.get(vineyard_distributed_tensor_id).local_partition()
data_X, data_Y = create_dataset(dataset)

Finally, we execute the training phase in a distributed manner across the cluster.

From this example, it is evident that with vineyard, the task in the big data context can be addressed with only minor adjustments to the single-machine solution. Compared to existing approaches, vineyard effectively eliminates I/O and transformation overheads.

Features

Efficient In-Memory Immutable Data Sharing

Vineyard serves as an in-memory immutable data manager, enabling efficient data sharing across different systems via shared memory without additional overheads. By eliminating serialization/deserialization and IO costs during data exchange between systems, Vineyard significantly improves performance.

Out-of-the-Box High-Level Data Abstractions

Computation frameworks often have their own data abstractions for high-level concepts. For example, tensors can be represented as torch.tensor, tf.Tensor, mxnet.ndarray, etc. Moreover, every graph processing engine has its unique graph structure representation.

The diversity of data abstractions complicates data sharing. Vineyard addresses this issue by providing out-of-the-box high-level data abstractions over in-memory blobs, using hierarchical metadata to describe objects. Various computation systems can leverage these built-in high-level data abstractions to exchange data with other systems in a computation pipeline concisely and efficiently.

Stream Pipelining for Enhanced Performance

A computation doesn’t need to wait for all preceding results to arrive before starting its work. Vineyard provides a stream as a special kind of immutable data for pipelining scenarios. The preceding job can write immutable data chunk by chunk to Vineyard while maintaining data structure semantics. The successor job reads shared-memory chunks from Vineyard’s stream without extra copy costs and triggers its work. This overlapping reduces the overall processing time and memory consumption.

Versatile Drivers for Common Tasks

Many big data analytical tasks involve numerous boilerplate routines that are unrelated to the computation itself, such as various IO adapters, data partition strategies, and migration jobs. Since data structure abstractions usually differ between systems, these routines cannot be easily reused.

Vineyard provides common manipulation routines for immutable data as drivers. In addition to sharing high-level data abstractions, Vineyard extends the capability of data structures with drivers, enabling out-of-the-box reusable routines for the boilerplate parts in computation jobs.

Try Vineyard

Vineyard is available as a python package and can be effortlessly installed using pip:

pip3 install vineyard

For comprehensive and up-to-date documentation, please visit https://v6d.io.

If you wish to build vineyard from source, please consult the Installation guide. For instructions on building and running unittests locally, refer to the Contributing section.

After installation, you can initiate a vineyard instance using the following command:

python3 -m vineyard

For further details on connecting to a locally deployed vineyard instance, please explore the Getting Started guide.

Deploying on Kubernetes

Vineyard is designed to efficiently share immutable data between different workloads, making it a natural fit for cloud-native computing. By embracing cloud-native big data processing and Kubernetes, Vineyard enables efficient distributed data sharing in cloud-native environments while leveraging the scaling and scheduling capabilities of Kubernetes.

To effectively manage all components of Vineyard within a Kubernetes cluster, we have developed the Vineyard Operator. For more information, please refer to the Vineyard Operator documentation.

FAQ

Vineyard shares many similarities with other open-source projects, yet it also has distinct features. We often receive the following questions about Vineyard:

  • Q: Can clients access the data while the stream is being filled?

    Sharing one piece of data among multiple clients is a target scenario for Vineyard, as the data stored in Vineyard is immutable. Multiple clients can safely consume the same piece of data through memory sharing, without incurring extra costs or additional memory usage from copying data back and forth.

  • Q: How does Vineyard avoid serialization/deserialization between systems in different languages?

    Vineyard provides high-level data abstractions (e.g., ndarrays, dataframes) that can be naturally shared between different processes, eliminating the need for serialization and deserialization between systems in different languages.

  • … …

For more detailed information, please refer to our FAQ page.

Get Involved

  • Join the CNCF Slack and participate in the #vineyard channel for discussions and collaboration.

  • Familiarize yourself with our contribution guide to understand the process of contributing to vineyard.

  • If you encounter any bugs or issues, please report them by submitting a GitHub issue or engage in a conversation on Github discussion.

  • We welcome and appreciate your contributions! Submit them using pull requests.

Thank you in advance for your valuable contributions to vineyard!

Publications

If you use this software, please cite our paper using the following metadata:

@article{yu2023vineyard,
   author = {Yu, Wenyuan and He, Tao and Wang, Lei and Meng, Ke and Cao, Ye and Zhu, Diwen and Li, Sanhong and Zhou, Jingren},
   title = {Vineyard: Optimizing Data Sharing in Data-Intensive Analytics},
   year = {2023},
   issue_date = {June 2023},
   publisher = {Association for Computing Machinery},
   address = {New York, NY, USA},
   volume = {1},
   number = {2},
   url = {https://doi.org/10.1145/3589780},
   doi = {10.1145/3589780},
   journal = {Proc. ACM Manag. Data},
   month = {jun},
   articleno = {200},
   numpages = {27},
   keywords = {data sharing, in-memory object store}
}

Acknowledgements

We thank the following excellent open-source projects:

  • apache-arrow, a cross-language development platform for in-memory analytics.

  • boost-leaf, a C++ lightweight error augmentation framework.

  • cityhash, CityHash, a family of hash functions for strings.

  • dlmalloc, Doug Lea’s memory allocator.

  • etcd-cpp-apiv3, a C++ API for etcd’s v3 client API.

  • flat_hash_map, an efficient hashmap implementation.

  • gulrak/filesystem, an implementation of C++17 std::filesystem.

  • libcuckoo, libcuckoo, a high-performance, concurrent hash table.

  • mimalloc, a general purpose allocator with excellent performance characteristics.

  • nlohmann/json, a json library for modern c++.

  • pybind11, a library for seamless operability between C++11 and Python.

  • s3fs, a library provide a convenient Python filesystem interface for S3.

  • skywalking-infra-e2e A generation End-to-End Testing framework.

  • skywalking-swck A kubernetes operator for the Apache Skywalking.

  • wyhash, C++ wrapper around wyhash and wyrand.

  • BBHash, a fast, minimal-memory perfect hash function.

  • rax, an ANSI C radix tree implementation.

  • MurmurHash3, a fast non-cryptographic hash function.

License

Vineyard is distributed under Apache License 2.0. Please note that third-party libraries may not have the same license as vineyard.

FOSSA Status

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

vineyard-0.24.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

vineyard-0.24.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

vineyard-0.24.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

vineyard-0.24.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

vineyard-0.24.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

vineyard-0.24.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

vineyard-0.24.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

vineyard-0.24.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

vineyard-0.24.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

vineyard-0.24.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

vineyard-0.24.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

vineyard-0.24.2-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ ARM64

vineyard-0.24.2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64

vineyard-0.24.2-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.17+ ARM64

File details

Details for the file vineyard-0.24.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9b47eb05320c6870c8120bb59d6c8739c2cf478debc8388d207d7eac7a5d7f97
MD5 96e27132ed8d203bf20b70c8581299d2
BLAKE2b-256 2704d4261c960f0ba577e03810eb1c9aa9118a4a9a814788bf337195afe45855

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f9be7c4a7e04a4a25dd1206f96488928da20b9621e0f0ea6efe2bb97b76fc0a1
MD5 38b6ccab82405b3a6c517c4faa790af7
BLAKE2b-256 0215e0d2d13a4a788639e786fa778f3e18e80514e51aa5ee0a48cf41a767973a

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6f2ffe6824700524a49cfdfbee1504740b4e48002e75f9487b8f267c799bc4f2
MD5 38c284dcbad10029ccbc0c33fbb231b7
BLAKE2b-256 eacac701019e0cf194dff98735bd7fffc0c49c04125e65640e7d2f268588eed8

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d26108fa6e80eb32e729c57c40c32c0a56f837108ac61e170a94aa1749934858
MD5 15acf5e89bab6dcf5449c09bf9800462
BLAKE2b-256 21b521d06dc601b1cd7de43336d127c285d409e18ff837f84203e742044d8f4d

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 95d81f2bcbfcbe0c8f76320991255923176f0dba7e1a93267b4fe75e3fa1d382
MD5 cc6c7e1ca04791c43d3e047b5ff2520f
BLAKE2b-256 8b1c05167a4b1a5ddd001532fb7df11fdd3614f6865af0d34f9ff3ecb9459f32

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 27660fa4f566e4b6600691cee958e7cb96ace49fa41a377f3a65975414f3c1bf
MD5 081f3351ede062a6bc8efb515a376e5a
BLAKE2b-256 427604e88f5473b0ab735f999d3bcb68b65bfc47d00c24fdc2b3daa199f4d891

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 39000b22f1be33cb5b78b881681f0acc1461007050a8a967ea80066a875d23f6
MD5 02592205207ea5e83eb117cdbb459aff
BLAKE2b-256 c02799401899601c3f8323b2f12d502090dbbb9234bc7b0a70ecad32c48bcf12

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a508cdc94f3c3a983d62687250483e9a5cc13610e7d19f740ed92b3807e71cbf
MD5 8109215a2f5dbf1d1dce8f03fe9fb72a
BLAKE2b-256 610561c8175c99f6a6eb3aa38d25be9c39576edec339fcd1f82530cbffbcaa4d

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2cba0ccee4ec48130631ee229fb57b7e35fd3035584419f35bc3798de0b04ffc
MD5 c6608646858b72993620572ec38f1329
BLAKE2b-256 561615341127a777a0acf8202674fe946ba4b6ef84a920ccb7e8b21b34f7e7ec

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2d3fd29a486866ae782d3a7102492754d193b6a57a6a84c1dd5856acefba8232
MD5 0d23c51109b5767f409d9d3f4a274c2e
BLAKE2b-256 8a00378f430fbd265132e41229dc99383af81425939b1d82304efc5e500fb504

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1cd5c514bc8ff68ec5201b819a4138fb81aa1c9da9d80bfdfd67a17af22e4d14
MD5 5dfb52804706792ef20f8fcdf37ec90d
BLAKE2b-256 ca01fc62ed19473feba79f83411baf62a84a21b5602b926a4a0b72a72edb4f9a

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 19534ee39c93755e127645251100dd26e431345b5e671a3a88787bd13a408c62
MD5 fba202e22a3803001d5f33c230f61095
BLAKE2b-256 e734eaae2eda61589dd1debeaa34f63d6753fa555519a60313550272c984e9a4

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0395f285be7f94c2a1874e9f4cd47f61a817c12ca79d55ec938cbc67b328aa57
MD5 b6687068a282267bb8c5fe561843f46f
BLAKE2b-256 eaa2a8489da0e63aa99a5b1b14eb44a57f240c9f6d5228ccf8dd43f2323a4686

See more details on using hashes here.

File details

Details for the file vineyard-0.24.2-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vineyard-0.24.2-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 05870b192193ff247f9693eae56ed17ada1b12b071a6bbac17918b489e3d9447
MD5 0b2ca7869ac8073fc1a82cbf59bd0ad9
BLAKE2b-256 8f02f8d096e6f850dffdb5c82320c3eed494723a9f43887d29be7d03cbc38375

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page