Skip to main content

Query parquet files using pyarrow or S3 Select by first gathering file metadata into a database

Project description

Lakeshack

A small rustic shack on the shores of a big lake A small rustic shack on the shores of a big lake

A simplified data lakehouse, more of a data lakeshack, optimized for retrieving filtered records from Parquet files. Similar to the various lakehouse solutions (Iceberg, Hudi, Delta Lake), Lakeshack gathers up the min/max values for specified columns from each Parquet file and stores them into a database (Metastore). When you want to query for a set of records, it first checks the Metastore to get the list of Parquet files that might have the desired records, and then only queries those Parquet files. The files may be stored locally or in S3. You may query using either native pyarrow or leverage S3 Select.

To acheive optimal performance, a partitioning & clustering strategy (which specifies how the records are written to the Parquet files) should align with the main query pattern expected to be used on the data. See the documentation for more information on this.

Installation

Lakeshack may be install using pip:

pip install lakeshack

Documentation

Documentation can be found at https://mhendrey.github.io/lakeshack

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakeshack-0.2.3.tar.gz (29.1 kB view hashes)

Uploaded Source

Built Distribution

lakeshack-0.2.3-py3-none-any.whl (32.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page