Skip to main content

Faster parquet metadata reading

Project description

PalletJack

PalletJack was created as a workaround for apache/arrow#38149. The standard parquet reader is not efficient for files with numerous columns and row groups, as it requires parsing the entire metadata section each time the file is opened. The size of this metadata section is proportional to the number of columns and row groups in the file.

PalletJack reduces the amount of metadata bytes that need to be read and decoded by storing metadata in a different format. This approach enables reading only the essential subset of metadata as required.

Features

  • Storing parquet metadata in an indexed format
  • Reading parquet metadata for a subset of row groups and columns

Required:

  • pyarrow ~= 16.0

PalletJack operates on top of pyarrow, making it an essential requirement for both building and using PalletJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install palletjack

How to use:

Generating a sample parquet file:

import palletjack as pj
import pyarrow.parquet as pq
import pyarrow.fs as fs
import pyarrow as pa
import numpy as np

row_groups = 200
columns = 200
chunk_size = 1000
rows = row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(rows, columns)
pa_arrays = [pa.array(data[:, i]) for i in range(columns)]
column_names = [f'column_{i}' for i in range(columns)]
table = pa.Table.from_arrays(pa_arrays, names=column_names)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=False, store_schema=False)

Generating the metadata index file:

index_path = path + '.index'
pj.generate_metadata_index(path, index_path)

Generating the in-memory metadata index:

index_data = pj.generate_metadata_index(path)

Writing the in-memory metadata index to a file using pyarrow's fs:

fs.LocalFileSystem().open_output_stream(index_path).write(index_data)

Reading the in-memory metadata index from a file using pyarrow's fs:

index_data = fs.LocalFileSystem().open_input_stream(index_path).readall()

Reading data with help of the index file:

metadata = pj.read_metadata(index_path, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading data with help of the in-memory index:

metadata = pj.read_metadata(index_data = index_data, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column indices:

metadata = pj.read_metadata(index_path, column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column names:

metadata = pj.read_metadata(index_path, column_names = ['column_1', 'column_3'])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of row groups and columns:

metadata = pj.read_metadata(index_path, row_groups = [5, 7], column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)

data = pr.read_all()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

palletjack-2.4.0.tar.gz (255.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

palletjack-2.4.0-cp312-cp312-win_amd64.whl (180.3 kB view details)

Uploaded CPython 3.12Windows x86-64

palletjack-2.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.4.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.4.0-cp311-cp311-win_amd64.whl (179.4 kB view details)

Uploaded CPython 3.11Windows x86-64

palletjack-2.4.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.4.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.4.0-cp310-cp310-win_amd64.whl (179.2 kB view details)

Uploaded CPython 3.10Windows x86-64

palletjack-2.4.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.4.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.4.0-cp39-cp39-win_amd64.whl (179.7 kB view details)

Uploaded CPython 3.9Windows x86-64

palletjack-2.4.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.4.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file palletjack-2.4.0.tar.gz.

File metadata

  • Download URL: palletjack-2.4.0.tar.gz
  • Upload date:
  • Size: 255.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for palletjack-2.4.0.tar.gz
Algorithm Hash digest
SHA256 6daddc9edbe55061e5ef61f6084d6b1fcd463810ec194eb38b5e25831f9eb88a
MD5 eed1297a300f499864677da73ec4cdee
BLAKE2b-256 8ae3880a00e7f47d6b0263037277d2e3b272c1c29733399615dfa430b54c1ea0

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0.tar.gz:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.4.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.4.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 180.3 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for palletjack-2.4.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 fdd5ab669f0c209eb664046440f6d695a114c0e8903fe42c360fd1fd9e6eea4c
MD5 77e021800631a079f103dd4163d40f48
BLAKE2b-256 b8a43b8641796219802924cca77dd01daa5c75edc43d672869acd32a6c9d97e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0-cp312-cp312-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 399a80724a7b23ef899d4a5f65796c897972f4a2d34735b056b976c5e522aa91
MD5 a9cee976a25ce7a59cb63e9932517963
BLAKE2b-256 757f2c49fc73731a6c77b073167836cf43ba06c28a13ec16942191eb1bc65439

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.4.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.4.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7700d1b5a8e84e3fc3a655d4a223fefbbd1cd0e0935b4e4c0a6c3b434873ee3c
MD5 17161773e103ac4e997314979b1864b0
BLAKE2b-256 c788c8966bc4a10987611f728cda6abeb3ecfac4acc7c2622609d9e316ffdfda

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.4.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.4.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 179.4 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for palletjack-2.4.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 9111322edb714d5db760204e0cc413f40ea129b2c817808b2deb050d689fdeef
MD5 7b30033b82d7784d5a818aded6251776
BLAKE2b-256 c31e6338461ded1ee37d1a869b375c9c3329b71e536da2273a4850d24851aa98

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0-cp311-cp311-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.4.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.4.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2cca72da3fadea469126f252b4c49eff3f717347d57c9e37560232ce2d3a06f9
MD5 3ffde6ab760c229b222b3f84ce76ce96
BLAKE2b-256 4b779a6d512e714b95a6c67ae653112175a1b9415c23a214b397309c42f3d79f

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.4.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.4.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5241c677f1fed7a1dce047e1dbb768a38e575322f842cf130db629ad310d6a5a
MD5 3dfc8d5b3d302a2f05e9f23f9f0f381b
BLAKE2b-256 cc8fa49968cf056c4086328b4c713f74f7eed28c63cc3cb81976a7a88b71a0d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.4.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.4.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 179.2 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for palletjack-2.4.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 b4c34c9359a762238127c7237f1cab1cc2b72aa5f4077e4eaaf3c445105535d6
MD5 b6d9c3746840008c973c68ef1f6078ef
BLAKE2b-256 66a5a5a31e495b957ed67047ca16608831f2cb76dd046a145f4651f438b4c4e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0-cp310-cp310-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.4.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.4.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0d6a2a2d8eff27aeaa37bb6092aaf543c07b371953cadb9620f21a375f0bb4fc
MD5 d68a9982e7ebc99e924b1a2e089380d1
BLAKE2b-256 a03786b68993194700b8e0927322d877b1f6cf0330b3e621c54657573d15dd1c

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.4.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.4.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 47f433f6de3677228e31d2e121eea3cfd79e3353e76f8a46cc1afbe029dfe48b
MD5 899821b3f7714fb34458a65200549e80
BLAKE2b-256 ab175720c4f139ccff43bbe4ad261947b348767f48fb19850f3fa3f5b807b3ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.4.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.4.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 179.7 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for palletjack-2.4.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 60da4b2f3c758d9522a44d663e931825b7a0fa728349404232f77bf24922057e
MD5 307748721a4883e22f43241e87fc52f6
BLAKE2b-256 63a2dd7ca2627052e04df77383f83c7ec4f225ccad96f8a3ba9656a799014946

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0-cp39-cp39-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.4.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.4.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8deb6bac90faf7c0f32e5496685f61c353472e857048e643ef7aa3b414932d5b
MD5 2070882f9e24a6bc3157adaebec445df
BLAKE2b-256 41dfd042744095d8e69e37796c86eb88fab467c3020da9a3381eadcc78ed86ea

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.4.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.4.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 e9830ade822a508f6f9cbbe12b0c9e8d03145d0d3cabef94d7596ab521c2aa04
MD5 adc2a32c67500e437afc540b5a331d75
BLAKE2b-256 21f81fd4b320c2f745b48dce8ff5979055ed8c7d2cc6f20042499bde878f7d91

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.4.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page