Query parquet files using pyarrow or S3 Select by first gathering file metadata into a database
Project description
Lakeshack
A simplified data lakehouse, more of a data lakeshack, optimized for retrieving filtered records from Parquet files. Similar to the various lakehouse solutions (Iceberg, Hudi, Delta Lake), Lakeshack gathers up the min/max values for specified columns from each Parquet file and stores them into a database (Metastore). When you want to query for a set of records, it first checks the Metastore to get the list of Parquet files that might have the desired records, and then only queries those Parquet files. The files may be stored locally or in S3. You may query using either native pyarrow or leverage S3 Select.
To acheive optimal performance, a partitioning & clustering strategy (which specifies how the records are written to the Parquet files) should align with the main query pattern expected to be used on the data. See the documentation for more information on this.
Installation
Lakeshack may be install using pip:
pip install lakeshack
Documentation
Documentation can be found at https://mhendrey.github.io/lakeshack
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lakeshack-0.2.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a373af67761f41f6a18f549913b8939267d294e9ff9ba8a62af58d453a6ecd76 |
|
MD5 | 9901ed7503c7a6f8539553bd7dab9f17 |
|
BLAKE2b-256 | bf7d657b45890d8a57cacf244f458cdc37e1407851298fcf6929591baf3c9d2c |