Skip to main content

Python bindings for Vortex, an Apache Arrow-compatible toolkit for working with compressed array data.

Reason this release was yanked:

Renamed to vortex-data

Project description

Vortex

Build Status Crates.io Documentation PyPI - Python Version

Vortex is a toolkit for working with compressed Apache Arrow arrays in-memory, on-disk, and over-the-wire.

Vortex is designed to be to columnar file formats what Apache DataFusion is to query engines (or, analogously, what LLVM + Clang are to compilers): a highly extensible & extremely fast framework for building a modern columnar file format, with a state-of-the-art, "batteries included" reference implementation.

Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster) and scans (2-10x faster), while preserving approximately the same compression ratio and write throughput. It will also support very wide tables (at least 10s of thousands of columns) and (eventually) on-device decompression on GPUs.

[!CAUTION] This library is still under rapid development and is a work in progress!

Some key features are not yet implemented, both the API and the serialized format are likely to change in breaking ways, and we cannot yet guarantee correctness in all cases.

The major features of Vortex are:

  • Logical Types - a schema definition that makes no assertions about physical layout.
  • Zero-Copy to Arrow - "canonicalized" (i.e., fully decompressed) Vortex arrays can be zero-copy converted to/from Apache Arrow arrays.
  • Extensible Encodings - a pluggable set of physical layouts. In addition to the builtin set of Arrow-compatible encodings, the Vortex repository includes a number of state-of-the-art encodings (e.g., FastLanes, ALP, FSST, etc.) that are implemented as extensions. While arbitrary encodings can be implemented as extensions, we have intentionally chosen a small set of encodings that are highly data-parallel, which in turn allows for efficient vectorized decoding, random access reads, and (in the future) decompression on GPUs.
  • Cascading Compression - data can be recursively compressed with multiple nested encodings.
  • Pluggable Compression Strategies - the built-in Compressor is based on BtrBlocks, but other strategies can trivially be used instead.
  • Compute - basic compute kernels that can operate over encoded data (e.g., for filter pushdown).
  • Statistics - each array carries around lazily computed summary statistics, optionally populated at read-time. These are available to compute kernels as well as to the compressor.
  • Serialization - Zero-copy serialization of arrays, both for IPC and for file formats.
  • Columnar File Format (in progress) - A modern file format that uses the Vortex serde library to store compressed array data. Optimized for random access reads and extremely fast scans; an aspiring successor to Apache Parquet.

Overview: Logical vs Physical

One of the core design principles in Vortex is strict separation of logical and physical concerns.

For example, a Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical encoding (the type of the array itself). Vortex ships with several built-in encodings, as well as several extension encodings.

The built-in encodings are primarily designed to model the Apache Arrow in-memory format, enabling us to construct Vortex arrays with zero-copy from Arrow arrays. There are also several built-in encodings (e.g., sparse and chunked) that are useful building blocks for other encodings. The included extension encodings are mostly designed to model compressed in-memory arrays, such as run-length or dictionary encoding.

Analogously, vortex-serde is designed to handle the low-level physical details of reading and writing Vortex arrays. Choices about which encodings to use or how to logically chunk data are left up to the Compressor implementation.

One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data within the file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to the file format specification.

Components

Logical Types

The Vortex type-system is still in flux. The current set of logical types is:

  • Null
  • Bool
  • Integer(8, 16, 32, 64)
  • Float(16, b16, 32, 64)
  • Binary
  • UTF8
  • Struct
  • List (partially implemented)
  • Date/Time/DateTime/Duration (implemented as an extension type)
  • Decimal: TODO
  • FixedList: TODO
  • Tensor: TODO
  • Union: TODO

Canonical/Flat Encodings

Vortex includes a base set of "flat" encodings that are designed to be zero-copy with Apache Arrow. These are the canonical representations of each of the logical data types. The canonical encodings currently supported are:

  • Null
  • Bool
  • Primitive (Integer, Float)
  • Struct
  • VarBin (Binary, UTF8)
  • VarBinView (Binary, UTF8)
  • Extension
  • ...with more to come

Compressed Encodings

Vortex includes a set of highly data-parallel, vectorized encodings. These encodings each correspond to a compressed in-memory array implementation, allowing us to defer decompression. Currently, these are:

  • Adaptive Lossless Floating Point (ALP)
  • BitPacked (FastLanes)
  • Constant
  • Chunked
  • Delta (FastLanes)
  • Dictionary
  • Fast Static Symbol Table (FSST)
  • Frame-of-Reference
  • Run-end Encoding
  • RoaringUInt
  • RoaringBool
  • Sparse
  • ZigZag
  • ...with more to come

Compression

Vortex's default compression strategy is based on the BtrBlocks paper.

Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted ( recursively) with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode the entire chunk. This sounds like it would be very expensive, but given basic statistics about a chunk, it is possible to cheaply prune many encodings and ensure the search space does not explode in size.

Compute

Vortex provides the ability for each encoding to specialize the implementation of a compute function to avoid decompressing where possible. For example, filtering a dictionary-encoded UTF8 array can be more cheaply performed by filtering the dictionary first.

Note--as mentioned above--that Vortex does not intend to become a full-fledged compute engine, but rather to implement basic compute operations as may be required for efficient scanning & pushdown.

Statistics

Vortex arrays carry lazily-computed summary statistics. Unlike other array libraries, these statistics can be populated from disk formats such as Parquet and preserved all the way into a compute engine. Statistics are available to compute kernels as well as to the compressor.

The current statistics are:

  • BitWidthFreq
  • TrailingZeroFreq
  • IsConstant
  • IsSorted
  • IsStrictSorted
  • Max
  • Min
  • RunCount
  • TrueCount
  • NullCount

Serialization / Deserialization (Serde)

The goals of the vortex-serde implementation are:

  • Support scanning (column projection + row filter) with zero-copy and zero heap allocation.
  • Support random access in constant or near-constant time.
  • Forward statistical information (such as sortedness) to consumers.
  • Provide IPC format for sending arrays between processes.
  • Provide an extensible, best-in-class file format for storing columnar data on disk or in object storage.

TODO: insert diagram here

Integration with Apache Arrow

Apache Arrow is the de facto standard for interoperating on columnar array data. Naturally, Vortex is designed to be maximally compatible with Apache Arrow. All Arrow arrays can be converted into Vortex arrays with zero-copy, and a Vortex array constructed from an Arrow array can be converted back to Arrow, again with zero-copy.

It is important to note that Vortex and Arrow have different--albeit complementary--goals.

Vortex explicitly separates logical types from physical encodings, distinguishing it from Arrow. This allows Vortex to model more complex arrays while still exposing a logical interface. For example, Vortex can model a UTF8 ChunkedArray where the first chunk is run-length encoded and the second chunk is dictionary encoded. In Arrow, RunLengthArray and DictionaryArray are separate incompatible types, and so cannot be combined in this way.

Usage

For best performance we recommend using MiMalloc as the application's allocator.

#[global_allocator]
static GLOBAL_ALLOC: MiMalloc = MiMalloc;

Contributing

Please see CONTRIBUTING.md.

Setup

In order to build vortex, you may also need to install the flatbuffer compiler (flatc):

Mac

brew install flatbuffers

This repo uses rye to manage the combined Rust/Python monorepo build. First, make sure to run:

# Install Rye from https://rye-up.com, and setup the virtualenv
rye sync

License

Licensed under the Apache License, Version 2.0 (the "License").

Governance

Vortex is and will remain an open-source project. Our intent is to model its governance structure after the Substrait project, which in turn is based on the model of the Apache Software Foundation. Expect more details on this in Q4 2024.

Acknowledgments 🏆

This project is inspired by and--in some cases--directly based upon the existing, excellent work of many researchers and OSS developers.

In particular, the following academic papers greatly influenced the development:

Additionally, we benefited greatly from:

  • the existence, ideas, & implementation of Apache Arrow.
  • likewise for the excellent Apache DataFusion project.
  • the parquet2 project by Jorge Leitao.
  • the public discussions around choices of compression codecs, as well as the C++ implementations thereof, from duckdb.
  • the Velox and Nimble projects, and discussions with their maintainers.

Thanks to all of the aforementioned for sharing their work and knowledge with the world! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vortex_array-0.11.0.tar.gz (317.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vortex_array-0.11.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

vortex_array-0.11.0-cp311-abi3-macosx_11_0_arm64.whl (4.4 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

vortex_array-0.11.0-cp311-abi3-macosx_10_12_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

vortex_array-0.11.0-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (9.0 MB view details)

Uploaded CPython 3.11+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file vortex_array-0.11.0.tar.gz.

File metadata

  • Download URL: vortex_array-0.11.0.tar.gz
  • Upload date:
  • Size: 317.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for vortex_array-0.11.0.tar.gz
Algorithm Hash digest
SHA256 4fec9906d37f72dc5b6ec508d2e5ecd74a233fb714a1e3b7f12b096addc16b28
MD5 319c62824e99362663b616807a51eb00
BLAKE2b-256 0285eb40f826e8b72567885f186df51fec64268f770a5563cc8f77325ff6399e

See more details on using hashes here.

File details

Details for the file vortex_array-0.11.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vortex_array-0.11.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6900cf3b4f4de93062407714d2d216f53451c7da37559ef64189fe86db25fb2b
MD5 9d2aa3cabb931d25b4401639521905cc
BLAKE2b-256 5c2f28080fd29415eb749ef199309af1fe38da1b9a16d9e0916aa64d95728be5

See more details on using hashes here.

File details

Details for the file vortex_array-0.11.0-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vortex_array-0.11.0-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4565a5f52dcc2e5d8a34f90015bfdd75cb98c89967701ebe051a585ede1666b5
MD5 08e823d0012662110481e15792b23cf4
BLAKE2b-256 689a34a8d4a46d55a7e6ac84123a1196736be4110a3c7ad4d56ffa17f06c2cde

See more details on using hashes here.

File details

Details for the file vortex_array-0.11.0-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for vortex_array-0.11.0-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0cd7290ce59108c61bb387cdd8c32ec010a1238d7cfac8da56e409dba94cd1a8
MD5 9b34a22f9957d0cc7210d2f994fc3f99
BLAKE2b-256 ae22ef43c38cfafe986c204f1e5dc86317d0d69f6bd51f3be9b41c28dbd10a0d

See more details on using hashes here.

File details

Details for the file vortex_array-0.11.0-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for vortex_array-0.11.0-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 689f76fa5bb0b7c05a0b7702ebaae91a89e6114f43cc9172a191e24eced47f24
MD5 ab0e473b7bbc620badcf7d7cbd3d7fc7
BLAKE2b-256 ef8a83cbfef603d13740f893f0012c2b8872c01877867c28cc21320e45ff222e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page