Skip to main content

Faster parquet metadata reading

Project description

PalletJack

PalletJack was created as a workaround for apache/arrow#38149. The standard parquet reader is not efficient for files with numerous columns and row groups, as it requires parsing the entire metadata section each time the file is opened. The size of this metadata section is proportional to the number of columns and row groups in the file.

PalletJack reduces the amount of metadata bytes that need to be read and decoded by storing metadata in a different format. This approach enables reading only the essential subset of metadata as required.

Features

  • Storing parquet metadata in an indexed format
  • Reading parquet metadata for a subset of row groups and columns

Required:

  • pyarrow ~= 16.0

PalletJack operates on top of pyarrow, making it an essential requirement for both building and using PalletJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install palletjack

How to use:

Generating a sample parquet file:

import palletjack as pj
import pyarrow.parquet as pq
import pyarrow.fs as fs
import pyarrow as pa
import numpy as np

row_groups = 200
columns = 200
chunk_size = 1000
rows = row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(rows, columns)
pa_arrays = [pa.array(data[:, i]) for i in range(columns)]
column_names = [f'column_{i}' for i in range(columns)]
table = pa.Table.from_arrays(pa_arrays, names=column_names)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=False, store_schema=False)

Generating the metadata index file:

index_path = path + '.index'
pj.generate_metadata_index(path, index_path)

Generating the in-memory metadata index:

index_data = pj.generate_metadata_index(path)

Writing the in-memory metadata index to a file using pyarrow's fs:

fs.LocalFileSystem().open_output_stream(index_path).write(index_data)

Reading the in-memory metadata index from a file using pyarrow's fs:

index_data = fs.LocalFileSystem().open_input_stream(index_path).readall()

Reading data with help of the index file:

metadata = pj.read_metadata(index_path, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading data with help of the in-memory index:

metadata = pj.read_metadata(index_data = index_data, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column indices:

metadata = pj.read_metadata(index_path, column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column names:

metadata = pj.read_metadata(index_path, column_names = ['column_1', 'column_3'])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of row groups and columns:

metadata = pj.read_metadata(index_path, row_groups = [5, 7], column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)

data = pr.read_all()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

palletjack-2.3.0.tar.gz (254.9 kB view details)

Uploaded Source

Built Distributions

palletjack-2.3.0-cp312-cp312-win_amd64.whl (180.1 kB view details)

Uploaded CPython 3.12 Windows x86-64

palletjack-2.3.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

palletjack-2.3.0-cp311-cp311-win_amd64.whl (179.3 kB view details)

Uploaded CPython 3.11 Windows x86-64

palletjack-2.3.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

palletjack-2.3.0-cp310-cp310-win_amd64.whl (179.0 kB view details)

Uploaded CPython 3.10 Windows x86-64

palletjack-2.3.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

palletjack-2.3.0-cp39-cp39-win_amd64.whl (179.5 kB view details)

Uploaded CPython 3.9 Windows x86-64

palletjack-2.3.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

File details

Details for the file palletjack-2.3.0.tar.gz.

File metadata

  • Download URL: palletjack-2.3.0.tar.gz
  • Upload date:
  • Size: 254.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for palletjack-2.3.0.tar.gz
Algorithm Hash digest
SHA256 f49b5f4ea3946d3544bc84d4ea18856a23a2c4c130d10ae56b185bd144de107f
MD5 eec285c4b94055a99ffaa3fbb6e551d9
BLAKE2b-256 eb8745fe3ab76df8854130740a89abedddeb7b941147cdf6c1876d37db8715bd

See more details on using hashes here.

File details

Details for the file palletjack-2.3.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 f42ad3eb79e882440e4260f30c0479949423321bfb72d68ba6bdb9f469ea9130
MD5 8c22a9a777c06ec12c0123e7de985dbe
BLAKE2b-256 34c0d5cf8886be5835fbf46ceb0323475cb1044da26d0aac9bcff7faf05c4eeb

See more details on using hashes here.

File details

Details for the file palletjack-2.3.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4f1dcd7300ea4d0451872c3a85c81f714b4afff7b0789382f6db0e788c676962
MD5 9393e6d386172a9025381d7e438fbbd0
BLAKE2b-256 24ba2550a735f102e92186d429b8361798a7c61d1b3f0c97764c8517cd1878b9

See more details on using hashes here.

File details

Details for the file palletjack-2.3.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8cb81896796274e6d988c13bc51a220dbf954e1d83427c32cb9cf8cd615ff4d7
MD5 ba6d9c4b51dcdf1c3d7e5d0949d7391e
BLAKE2b-256 2d602030b978ea5f1557b9be3196183dea7de4d1ffabb24a7b235b46367a698f

See more details on using hashes here.

File details

Details for the file palletjack-2.3.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 91b628b949fe847d6208d8af64b6c03f3ab42472914025b845419f4621e21d45
MD5 4a5f857cb09479075c11ac54bdd230ef
BLAKE2b-256 e8eaf639ba9104e0467f208bae6b18b2bf96333a34920d0a61efcfc779a535bd

See more details on using hashes here.

File details

Details for the file palletjack-2.3.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 23318f4394354ede01b00dd7fa9a3bdcbb9d41ac33920d5966f8a3047896bd97
MD5 f43d79536688161237154e610509ab64
BLAKE2b-256 ff938042bd17f12c4d66b546c5c289c7ba5274a35c7a742e14ec11435235c7ac

See more details on using hashes here.

File details

Details for the file palletjack-2.3.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 30e06aa4721b657377ba92b2245d5acd2808c7eb481952b75ea567a332419ef2
MD5 3bb20a19367d300f2dee526d99a556bf
BLAKE2b-256 756369475f0d86b69724a21c2cc030562af80e22cdb42630abc6a81b43535360

See more details on using hashes here.

File details

Details for the file palletjack-2.3.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.3.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 179.5 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for palletjack-2.3.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 0f1c9d4d99838740c554b8682ca5ca55d4660792d8ff94982990680e89377883
MD5 912296d5b2f6885901bad6886d52c43e
BLAKE2b-256 2f49faf86a1d5d07c3957103d64ef956e145c89c7dae53b45653a8b5db14c10a

See more details on using hashes here.

File details

Details for the file palletjack-2.3.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f7b1c749fb9b19d19db18db48217a8df1960bd765f3cae76b465c9e26471a68d
MD5 4ce7db095dfcce707071d87c4aa4047b
BLAKE2b-256 0878fc6bb20635b408eda9a6386c05caaca4716a9ff9ac69c14bab4c278d9b6c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page