Skip to main content

Faster parquet metadata reading

Project description

PalletJack

PalletJack was created as a workaround for apache/arrow#38149. The standard parquet reader is not efficient for files with numerous columns and row groups, as it requires parsing the entire metadata section each time the file is opened. The size of this metadata section is proportional to the number of columns and row groups in the file.

PalletJack reduces the amount of metadata bytes that need to be read and decoded by storing metadata in a different format. This approach enables reading only the essential subset of metadata as required.

Features

  • Storing parquet metadata in an indexed format
  • Reading parquet metadata for a subset of row groups and columns

Required:

  • pyarrow ~= 19.0.0

PalletJack operates on top of pyarrow, making it an essential requirement for both building and using PalletJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install palletjack

How to use:

Generating a sample parquet file:

import palletjack as pj
import pyarrow.parquet as pq
import pyarrow.fs as fs
import pyarrow as pa
import numpy as np

row_groups = 200
columns = 200
chunk_size = 1000
rows = row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(rows, columns)
pa_arrays = [pa.array(data[:, i]) for i in range(columns)]
column_names = [f'column_{i}' for i in range(columns)]
table = pa.Table.from_arrays(pa_arrays, names=column_names)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=False, store_schema=False)

Generating the metadata index file:

index_path = path + '.index'
pj.generate_metadata_index(path, index_path)

Generating the in-memory metadata index:

index_data = pj.generate_metadata_index(path)

Writing the in-memory metadata index to a file using pyarrow's fs:

fs.LocalFileSystem().open_output_stream(index_path).write(index_data)

Reading the in-memory metadata index from a file using pyarrow's fs:

index_data = fs.LocalFileSystem().open_input_stream(index_path).readall()

Reading data with help of the index file:

metadata = pj.read_metadata(index_path, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading data with help of the in-memory index:

metadata = pj.read_metadata(index_data = index_data, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column indices:

metadata = pj.read_metadata(index_path, column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column names:

metadata = pj.read_metadata(index_path, column_names = ['column_1', 'column_3'])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of row groups and columns:

metadata = pj.read_metadata(index_path, row_groups = [5, 7], column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)

data = pr.read_all()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

palletjack-2.5.0.tar.gz (256.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

palletjack-2.5.0-cp312-cp312-win_amd64.whl (180.4 kB view details)

Uploaded CPython 3.12Windows x86-64

palletjack-2.5.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.5.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.5.0-cp311-cp311-win_amd64.whl (179.5 kB view details)

Uploaded CPython 3.11Windows x86-64

palletjack-2.5.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.5.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.5.0-cp310-cp310-win_amd64.whl (179.3 kB view details)

Uploaded CPython 3.10Windows x86-64

palletjack-2.5.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.5.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.5.0-cp39-cp39-win_amd64.whl (179.8 kB view details)

Uploaded CPython 3.9Windows x86-64

palletjack-2.5.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.5.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file palletjack-2.5.0.tar.gz.

File metadata

  • Download URL: palletjack-2.5.0.tar.gz
  • Upload date:
  • Size: 256.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for palletjack-2.5.0.tar.gz
Algorithm Hash digest
SHA256 a0e2428dda2aeb05805516d27fadf970de70693df21b83dfc2e58b0d3cf67d03
MD5 ded4315a624c45bd1978ce3b73c1010a
BLAKE2b-256 2f491e895c88a99a8635e4d7775d74e70329c1b060df0ec32f77d4268b2c29d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0.tar.gz:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.5.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 180.4 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for palletjack-2.5.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 cd2ac559feaeec7366b2aa2d1cba1206584334b84c94a9ee94d839b051433dbf
MD5 c403130405361012d522da80a8f00454
BLAKE2b-256 2bf450a5f60f7ba068eb6efa811b08be9debeb7cff979318b6ee018b4d1dc60a

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0-cp312-cp312-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8a0889d42b3b6a3f686f51e538e1e3b1040a6d7a3735a832cb669aefb1a1f2e0
MD5 77115eaae41ad258103d1993a723ad1e
BLAKE2b-256 97615cf9c496ea711a280d5f2f03049ae9c5da2094f9704b9a2719d2adce51af

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7ba71b4c783076a2b3fc9c708334d06e2837fd6526d40c4d93fe4e57db02e12b
MD5 232dc7557af082aa7f4a2acb3ea58c2a
BLAKE2b-256 59ac614bea5d3149c20c4c2eb2a54aada64b5dd90647fe556938a4a1a089a846

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.5.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 179.5 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for palletjack-2.5.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 f0294e9fde7c964511c86e47d6d17f856679a74149e49d8bb99d39d007a3b14a
MD5 6c618ef97d3fa62b7b829e6ad9378490
BLAKE2b-256 7ad52c669c546d4c463ae5d5866c85396b45becbe83bf29fb7a1e2eb0a6fb799

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0-cp311-cp311-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 dfe6743416ec8a09450016204ec47f817d15c22ada02e10ab7994942954d98e6
MD5 b860fd620a671c6969818876902a001c
BLAKE2b-256 966c342d30b2cd035989c123a926718145beaa2fe433ec9470e9d9add7f8ac23

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 80572582fd549bfa54bd4c283da9d7f3242074f7e303f9f7abe3394a83c4fdf6
MD5 f559e0ee79a76aee2725e6bf1fb26f41
BLAKE2b-256 5e6b3231c16b69b52d3be2b66db42b79436bd845ed44877a71edeeaea4b4ff9b

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.5.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 179.3 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for palletjack-2.5.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 ad47f98499c18a9891e1447e5739e0e2bae32d25789ba996f317b72bd98cbc5a
MD5 939fe5e29df625986864d63872395499
BLAKE2b-256 b15589badfc801f55fc60127d447de67d7a19fb56334d708b0b998d8670c8ffb

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0-cp310-cp310-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9e7672b1fb0c5aa45bbac38eec96e1b49edad44231f249628f76ec48848af9cc
MD5 666e8dd2fc013f8235781afb8d25af42
BLAKE2b-256 171a5d3e2f168bbe1a64af8d03f3f3a1e9344f0e5b22d0e4d9d5e801a4890b10

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7ca6c8306b1abb89c943f267fd358a75ad5ab113b893ea7f1f986a17b229709b
MD5 0f42d487e39f14e7375d82a8d98cbe64
BLAKE2b-256 91366e7ed5f57fef300039bf43a25be705795f8f5701c018a5b0347b9bf18626

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.5.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 179.8 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for palletjack-2.5.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 c69d3e3161fe7f3c7d47d665f89110dd718c85f2bbe8885728e5b9e8b6871918
MD5 d9a7af491aff8506b007f990e5778fb6
BLAKE2b-256 d8ca3a587060ff0d5f446135c6dbb6b394616926b9ed1af5ad53a6eb4fdf1cff

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0-cp39-cp39-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 56d4419df24973e2e1b32e1b28090bb427f037c5554d371d6fd4f25c6f17bde7
MD5 d6d8177b9e567162237093317faa5e2e
BLAKE2b-256 57eda344fb171d208b6ed0171b6e2fdcb5d8af2afcc77df76abb1fee3337764d

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 2f9d02a84146b0bdd9b09ad71936513bb02b0cea4b59edffea06d277ab47334a
MD5 b703d385c38ee71d94e665a791567846
BLAKE2b-256 0c59a7ee42f1be6cf07d8b406178e7990e5ee51c965fd86a3415ce28abd63ae4

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page