lakeshack

Query parquet files using pyarrow or S3 Select by first gathering file metadata into a database

These details have not been verified by PyPI

Project links

Project description

Lakeshack

A small rustic shack on the shores of a big lake

A simplified data lakehouse, more of a data lakeshack, optimized for retrieving filtered records from Parquet files. Similar to the various lakehouse solutions (Iceberg, Hudi, Delta Lake), Lakeshack gathers up the min/max values for specified columns from each Parquet file and stores them into a database (Metastore). When you want to query for a set of records, it first checks the Metastore to get the list of Parquet files that might have the desired records, and then only queries those Parquet files. The files may be stored locally or in S3. You may query using either native pyarrow or leverage S3 Select.

To acheive optimal performance, a partitioning & clustering strategy (which specifies how the records are written to the Parquet files) should align with the main query pattern expected to be used on the data. See the documentation for more information on this.

Installation

Lakeshack may be install using pip:

pip install lakeshack

Documentation

Documentation can be found at https://mhendrey.github.io/lakeshack

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.3

Nov 18, 2023

0.2.2

Jun 1, 2023

0.2.1

Apr 4, 2023

0.2.0

Mar 26, 2023

0.1.0

Mar 24, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakeshack-0.2.3.tar.gz (29.1 kB view hashes)

Uploaded Nov 18, 2023 Source

Built Distribution

lakeshack-0.2.3-py3-none-any.whl (32.9 kB view hashes)

Uploaded Nov 18, 2023 Python 3

Hashes for lakeshack-0.2.3.tar.gz

Hashes for lakeshack-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`2adfc4838e5e691534e8a73e072c4c84f83d72f30a282a70a931463dc6abb0ef`
MD5	`cfd48ba7039db591de2c12655f40b106`
BLAKE2b-256	`75352b9b341df5282f8d821a60f3db1ea0ed8fe64cef83e2af3c1816dead13cf`

Hashes for lakeshack-0.2.3-py3-none-any.whl

Hashes for lakeshack-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a373af67761f41f6a18f549913b8939267d294e9ff9ba8a62af58d453a6ecd76`
MD5	`9901ed7503c7a6f8539553bd7dab9f17`
BLAKE2b-256	`bf7d657b45890d8a57cacf244f458cdc37e1407851298fcf6929591baf3c9d2c`