Skip to main content

Python bindings for Vortex, an Apache Arrow-compatible toolkit for working with compressed array data.

Reason this release was yanked:

Renamed to vortex-data

Project description

Vortex

Build Status Crates.io Documentation PyPI - Python Version

Vortex is an Apache Arrow-compatible toolkit for working with compressed array data. We are using Vortex to develop a next-generation columnar file format for multidimensional arrays.

[!CAUTION] This library is still under rapid development and is very much a work in progress!

Some key features are not yet implemented, the API will almost certainly change in breaking ways, and we cannot yet guarantee correctness in all cases.

The major components of Vortex are (will be!):

  • Logical Types - a schema definition that makes no assertions about physical layout.
  • Encodings - a pluggable set of physical layouts. Vortex ships with several state-of-the-art lightweight compression codecs that have the potential to support GPU decompression.
  • Compression - recursive compression based on stratified samples of the input.
  • Compute - basic compute kernels that can operate over compressed data. Note that Vortex does not intend to become a full-fledged compute engine, but rather to provide the ability to implement basic compute operations as may be required for efficient scanning & pushdown operations.
  • Statistics - each array carries around lazily computed summary statistics, optionally populated at read-time. These are available to compute kernels as well as to the compressor.
  • Serde - zero-copy serialization. Useful as a building block in creating IPC or file formats that contain compressed arrays.

Overview: Logical vs Physical

One of the core principles in Vortex is separation of the logical from the physical.

A Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical encoding (the type of the array itself). Vortex ships with several built-in encodings, as well as several extension encodings.

The built-in encodings are primarily designed to model the Apache Arrow in-memory format, enabling us to construct Vortex arrays with zero-copy from Arrow arrays. There are also several built-in encodings (e.g., sparse and chunked) that are useful building blocks for other encodings. The included extension encodings are mostly designed to model compressed in-memory arrays, such as run-length or dictionary encoding.

Components

Logical Types

The Vortex type-system is still in flux. The current set of logical types is:

  • Null
  • Bool
  • Integer(8, 16, 32, 64)
  • Float(16, b16, 32, 64)
  • Binary
  • UTF8
  • Struct
  • Decimal: TODO
  • Date/Time/DateTime/Duration: TODO (in-progress, currently partially supported)
  • List: TODO
  • FixedList: TODO
  • Union: TODO

Canonical/Flat Encodings

Vortex includes a base set of "flat" encodings that are designed to be zero-copy with Apache Arrow. These are the canonical representations of each of the logical data types. The canonical encodings currently supported are:

  • Null
  • Bool
  • Primitive (Integer, Float)
  • Struct
  • VarBin
  • VarBinView
  • ...with more to come

Compressed Encodings

Vortex includes a set of highly data-parallel, vectorized encodings. These encodings each correspond to a compressed in-memory array implementation, allowing us to defer decompression. Currently, these are:

  • Adaptive Lossless Floating Point (ALP)
  • BitPacked (FastLanes)
  • Constant
  • Chunked
  • Delta (FastLanes)
  • Dictionary
  • Frame-of-Reference
  • Run-end Encoding
  • RoaringUInt
  • RoaringBool
  • Sparse
  • ZigZag

Compression

Vortex's top-level compression strategy is based on the BtrBlocks paper.

Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted ( recursively) with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode the entire chunk. This sounds like it would be very expensive, but given basic statistics about a chunk, it is possible to cheaply prune many encodings and ensure the search space does not explode in size.

Compute

Vortex provides the ability for each encoding to specialize the implementation of a compute function to avoid decompressing where possible. For example, filtering a dictionary-encoded UTF8 array can be more cheaply performed by filtering the dictionary first.

Note that Vortex does not intend to become a full-fledged compute engine, but rather to provide the ability to implement basic compute operations as may be required for efficient scanning & operation pushdown.

Statistics

Vortex arrays carry lazily-computed summary statistics. Unlike other array libraries, these statistics can be populated from disk formats such as Parquet and preserved all the way into a compute engine. Statistics are available to compute kernels as well as to the compressor.

The current statistics are:

  • BitWidthFreq
  • TrailingZeroFreq
  • IsConstant
  • IsSorted
  • IsStrictSorted
  • Max
  • Min
  • RunCount
  • TrueCount
  • NullCount

Serialization / Deserialization (Serde)

Vortex serde is currently in the design phase. The goals of this implementation are:

  • Support scanning (column projection + row filter) with zero-copy and zero heap allocation.
  • Support random access in constant time.
  • Forward statistical information (such as sortedness) to consumers.
  • To provide a building block for file format authors to store compressed array data.

Integration with Apache Arrow

Apache Arrow is the de facto standard for interoperating on columnar array data. Naturally, Vortex is designed to be maximally compatible with Apache Arrow. All Arrow arrays can be converted into Vortex arrays with zero-copy, and a Vortex array constructed from an Arrow array can be converted back to Arrow, again with zero-copy.

It is important to note that Vortex and Arrow have different--albeit complementary--goals.

Vortex explicitly separates logical types from physical encodings, distinguishing it from Arrow. This allows Vortex to model more complex arrays while still exposing a logical interface. For example, Vortex can model a UTF8 ChunkedArray where the first chunk is run-length encoded and the second chunk is dictionary encoded. In Arrow, RunLengthArray and DictionaryArray are separate incompatible types, and so cannot be combined in this way.

Usage

For best performance we recommend using MiMalloc as the application's allocator.

#[global_allocator]
static GLOBAL_ALLOC: MiMalloc = MiMalloc;

Contributing

Please see CONTRIBUTING.md.

Setup

In order to build vortex, you may also need to install the flatbuffer compiler (flatc):

Mac

brew install flatbuffers

This repo uses rye to manage the combined Rust/Python monorepo build. First, make sure to run:

# Install Rye from https://rye-up.com, and setup the virtualenv
rye sync

License

Licensed under the Apache License, Version 2.0 (the "License").

Acknowledgments 🏆

This project is inspired by and--in some cases--directly based upon the existing, excellent work of many researchers and OSS developers.

In particular, the following academic papers greatly influenced the development:

Additionally, we benefited greatly from:

Thanks to all of the aforementioned for sharing their work and knowledge with the world! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vortex_array-0.10.0.tar.gz (310.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vortex_array-0.10.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

vortex_array-0.10.0-cp311-abi3-macosx_11_0_arm64.whl (4.4 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

vortex_array-0.10.0-cp311-abi3-macosx_10_12_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

vortex_array-0.10.0-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (9.0 MB view details)

Uploaded CPython 3.11+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file vortex_array-0.10.0.tar.gz.

File metadata

  • Download URL: vortex_array-0.10.0.tar.gz
  • Upload date:
  • Size: 310.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for vortex_array-0.10.0.tar.gz
Algorithm Hash digest
SHA256 19725231d729e7899916dbf22334f825467cd27ba307365fda9bddd7b7ab34cc
MD5 9d53f05f845817ecc8aed667afae5bdd
BLAKE2b-256 037e0e87a7c92d47ba59df2f7b3d79df5bd127773802cf532cabf1e4f520638e

See more details on using hashes here.

File details

Details for the file vortex_array-0.10.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vortex_array-0.10.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e1eba0627d255e0d0080277c1d9a1d7f957f6a60389aef051b1cd1157d87e531
MD5 6c0d1320287dbf9a2489c5c2e3322b91
BLAKE2b-256 0f693966e8ad773c49a9c0d8533a62119249ae75a74b569aaef517ee5e8f8e0f

See more details on using hashes here.

File details

Details for the file vortex_array-0.10.0-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vortex_array-0.10.0-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ba61d8e1d43f0fe14d63783774dfaec3f989d10afade5a18a9b197a012e66e83
MD5 534d7d2ae92362b2629e8374de0c228e
BLAKE2b-256 6af6d0a78198fd4e4971d99242355646017fb17e929afcc769e663402bf7c3f4

See more details on using hashes here.

File details

Details for the file vortex_array-0.10.0-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for vortex_array-0.10.0-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 46fc5b602e376f8d014d0cece4262710e43cb8a0e1018f6f60dfa07e4abb4f64
MD5 a486e202073f5ff0eacca7299fdae49f
BLAKE2b-256 85f07a2f1cdc539458d3022ad40ae99ed3cd55c7f9d11e18d4915821063e0e44

See more details on using hashes here.

File details

Details for the file vortex_array-0.10.0-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for vortex_array-0.10.0-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 a60d21a1b7d3de14bde4c16ea11a749ca33f2d28e34c91a14532cedc415fc147
MD5 533bd607b62365cfe8888fcc9894a010
BLAKE2b-256 222e9eaf5af92476515341b03daf54b416727b9d40260db25d170f9c1054d475

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page