Skip to main content

Faster parquet metadata reading

Project description

PalletJack

PalletJack was created as a workaround for apache/arrow#38149. The standard parquet reader is not efficient for files with numerous columns and row groups, as it requires parsing the entire metadata section each time the file is opened. The size of this metadata section is proportional to the number of columns and row groups in the file.

PalletJack reduces the amount of metadata bytes that need to be read and decoded by storing metadata in a different format. This approach enables reading only the essential subset of metadata as required.

Features

  • Storing parquet metadata in an indexed format
  • Reading parquet metadata for a subset of row groups and columns

Required:

  • pyarrow ~= 16.0

PalletJack operates on top of pyarrow, making it an essential requirement for both building and using PalletJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install palletjack

How to use:

Generating a sample parquet file:

import palletjack as pj
import pyarrow.parquet as pq
import pyarrow.fs as fs
import pyarrow as pa
import numpy as np

row_groups = 200
columns = 200
chunk_size = 1000
rows = row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(rows, columns)
pa_arrays = [pa.array(data[:, i]) for i in range(columns)]
column_names = [f'column_{i}' for i in range(columns)]
table = pa.Table.from_arrays(pa_arrays, names=column_names)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=False, store_schema=False)

Generating the metadata index file:

index_path = path + '.index'
pj.generate_metadata_index(path, index_path)

Generating the in-memory metadata index:

index_data = pj.generate_metadata_index(path)

Writing the in-memory metadata index to a file using pyarrow's fs:

fs.LocalFileSystem().open_output_stream(index_path).write(index_data)

Reading the in-memory metadata index from a file using pyarrow's fs:

index_data = fs.LocalFileSystem().open_input_stream(index_path).readall()

Reading data with help of the index file:

metadata = pj.read_metadata(index_path, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading data with help of the in-memory index:

metadata = pj.read_metadata(index_data = index_data, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column indices:

metadata = pj.read_metadata(index_path, column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column names:

metadata = pj.read_metadata(index_path, column_names = ['column_1', 'column_3'])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of row groups and columns:

metadata = pj.read_metadata(index_path, row_groups = [5, 7], column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)

data = pr.read_all()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

palletjack-2.3.1.tar.gz (255.0 kB view details)

Uploaded Source

Built Distributions

palletjack-2.3.1-cp312-cp312-win_amd64.whl (180.2 kB view details)

Uploaded CPython 3.12 Windows x86-64

palletjack-2.3.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

palletjack-2.3.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.24+ ARM64 manylinux: glibc 2.28+ ARM64

palletjack-2.3.1-cp311-cp311-win_amd64.whl (179.3 kB view details)

Uploaded CPython 3.11 Windows x86-64

palletjack-2.3.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

palletjack-2.3.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.24+ ARM64 manylinux: glibc 2.28+ ARM64

palletjack-2.3.1-cp310-cp310-win_amd64.whl (179.2 kB view details)

Uploaded CPython 3.10 Windows x86-64

palletjack-2.3.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

palletjack-2.3.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.24+ ARM64 manylinux: glibc 2.28+ ARM64

palletjack-2.3.1-cp39-cp39-win_amd64.whl (179.6 kB view details)

Uploaded CPython 3.9 Windows x86-64

palletjack-2.3.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

palletjack-2.3.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.24+ ARM64 manylinux: glibc 2.28+ ARM64

File details

Details for the file palletjack-2.3.1.tar.gz.

File metadata

  • Download URL: palletjack-2.3.1.tar.gz
  • Upload date:
  • Size: 255.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for palletjack-2.3.1.tar.gz
Algorithm Hash digest
SHA256 75adff84047c687343f49cf8fbb1ec47fcc0d65eaea4a823cc148a657a06ee6d
MD5 08e1fe290bbe95fd17a3d926a62a1fcb
BLAKE2b-256 3b3fb2d0d01104a48986b268605a87e9904bea86e5b519f1bb37e4ef97307032

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1.tar.gz:

Publisher: python.yml on G-Research/PalletJack

Attestations:

File details

Details for the file palletjack-2.3.1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 d2e7f8fef59a702049898e8269cec3eaddf78e5569584e00db9e61223011519f
MD5 813a8165ae7ba168107948e42368ef9c
BLAKE2b-256 2f8ac637eb944f45e98d688a4e44ab92d84d24e367d30dae2f280727f54454c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1-cp312-cp312-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations:

File details

Details for the file palletjack-2.3.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b87cdfcdfae3a3795619e5d933f94e94731258a172c9f20483eccca83e08bcea
MD5 a0c6a8f59957d855d810beceea551cc9
BLAKE2b-256 7caf1dfa1b61e070c3c608bcb7071cf653329c3537fa23de5c8a8d5d6ed903f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations:

File details

Details for the file palletjack-2.3.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b644605de036639fa826c268b5640570840d9f54bddd15b7b214ce99476956c7
MD5 49b61f31e25846adeed9eb7174592f70
BLAKE2b-256 0d4456c2584e3bb0d0647d3f73d08cb536c8f1c28f2b20f684ec16f7dab1077f

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations:

File details

Details for the file palletjack-2.3.1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 5181c6bb54e3934143dfa0071bd4439631ca4753b54156882005d00c075fe7d3
MD5 b6ba2e966ddab235543de22dc5b99a98
BLAKE2b-256 79fcb92406a586bdcce8347b5b97f54191a0f65da0f2d89c1a89517974de4e84

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1-cp311-cp311-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations:

File details

Details for the file palletjack-2.3.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bacfbf40940a75986269d39d17d922174e1de4570ff86fb1e5ceca7c1cd12607
MD5 75d42344ae85c87c603f4baa557c49f3
BLAKE2b-256 1a3fbc9c3b778e4aed64e5bb126dafea38d7ff4f0f70b05f5f520b8d20687eeb

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations:

File details

Details for the file palletjack-2.3.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7965d8601268eb12aa674aeb4d7c983bf4c1729df9ca864e6209220265b01de7
MD5 b665721c92ccbe7904f893c980313b91
BLAKE2b-256 fd9dd991d94196084f267d94026fc78d88d4656e11270346278ac8c8e44417dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations:

File details

Details for the file palletjack-2.3.1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 1a3d4a28587010ce5388986f174708fc72f92fcd9cd1cd37b60d7bb76f9df4d1
MD5 f438c27ead92e4453d62b791c78db60f
BLAKE2b-256 8e8d6fed4eb9b5364500dba006ad5e2ba266b0b33a9d587449dcc54bf56d525b

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1-cp310-cp310-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations:

File details

Details for the file palletjack-2.3.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f71afe9fd7ff3793b25a3fa74bb4feaddca9d35b7e2f993b873f65fa72482ff9
MD5 5ad43fb939a53fafcfafd13eea8b0b21
BLAKE2b-256 337ea6b544980852aa2452e1effccf84eda2398b5c01be993cf20c07bbe02a76

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations:

File details

Details for the file palletjack-2.3.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 ab29916d04fef0012a3a10d982993d774a3f7633e648decb8459e8318b3f2652
MD5 69c6afca58237cd1cff1aace59b6fce5
BLAKE2b-256 d07a855f30a5c7cd27a0e85491bbbc5a5f36b915de90f4e3ad9148556189c5e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations:

File details

Details for the file palletjack-2.3.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.3.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 179.6 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for palletjack-2.3.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 86e9dd1b34677a1501395fea19ffdfceff2b29291f8cfe71dd654ab2fc45c264
MD5 7cbafe062cd37e2c605498cac78deca8
BLAKE2b-256 3cbfaf34135034acfb5afcae899945cee2aa1012fca8dfba2dc6eb5abd745bb1

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1-cp39-cp39-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations:

File details

Details for the file palletjack-2.3.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8f3103e02e54d22a78796b9aad4ab076f9e13fa514675cdf810a0b9f7d410424
MD5 102377bfa5e4a167570e7df921c39974
BLAKE2b-256 62f3743b2a0350d68dd7a9654e0f117ee319831fb82903ae1b5e0d1d699abcc1

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations:

File details

Details for the file palletjack-2.3.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.3.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 61c72a6cd7ef19f0bab3ea29d99b02c8d4cc901f8cefe89ff18db67b3482e653
MD5 70d4e9739ad32d081b00e954c6453bbc
BLAKE2b-256 2fac2f564cea45e55e0796441e12abb591a7f985d91d2951ec1209f36c6ae295

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.3.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page