Skip to main content

Faster parquet metadata reading

Project description

PalletJack

PalletJack was created as a workaround for apache/arrow#38149. The standard parquet reader is not efficient for files with numerous columns and row groups, as it requires parsing the entire metadata section each time the file is opened. The size of this metadata section is proportional to the number of columns and row groups in the file.

PalletJack reduces the amount of metadata bytes that need to be read and decoded by storing metadata in a different format. This approach enables reading only the essential subset of metadata as required.

Features

  • Storing parquet metadata in an indexed format
  • Reading parquet metadata for a subset of row groups and columns

Required:

  • pyarrow ~= 20.0.0

PalletJack operates on top of pyarrow, making it an essential requirement for both building and using PalletJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install palletjack

How to use:

Generating a sample parquet file:

import palletjack as pj
import pyarrow.parquet as pq
import pyarrow.fs as fs
import pyarrow as pa
import numpy as np

row_groups = 200
columns = 200
chunk_size = 1000
rows = row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(rows, columns)
pa_arrays = [pa.array(data[:, i]) for i in range(columns)]
column_names = [f'column_{i}' for i in range(columns)]
table = pa.Table.from_arrays(pa_arrays, names=column_names)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=False, store_schema=False)

Generating the metadata index file:

index_path = path + '.index'
pj.generate_metadata_index(path, index_path)

Generating the in-memory metadata index:

index_data = pj.generate_metadata_index(path)

Writing the in-memory metadata index to a file using pyarrow's fs:

fs.LocalFileSystem().open_output_stream(index_path).write(index_data)

Reading the in-memory metadata index from a file using pyarrow's fs:

index_data = fs.LocalFileSystem().open_input_stream(index_path).readall()

Reading data with help of the index file:

metadata = pj.read_metadata(index_path, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading data with help of the in-memory index:

metadata = pj.read_metadata(index_data = index_data, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column indices:

metadata = pj.read_metadata(index_path, column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column names:

metadata = pj.read_metadata(index_path, column_names = ['column_1', 'column_3'])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of row groups and columns:

metadata = pj.read_metadata(index_path, row_groups = [5, 7], column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)

data = pr.read_all()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

palletjack-2.6.0.tar.gz (258.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

palletjack-2.6.0-cp312-cp312-win_amd64.whl (180.7 kB view details)

Uploaded CPython 3.12Windows x86-64

palletjack-2.6.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.6.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.6.0-cp311-cp311-win_amd64.whl (179.9 kB view details)

Uploaded CPython 3.11Windows x86-64

palletjack-2.6.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.6.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.6.0-cp310-cp310-win_amd64.whl (179.8 kB view details)

Uploaded CPython 3.10Windows x86-64

palletjack-2.6.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.6.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.6.0-cp39-cp39-win_amd64.whl (179.9 kB view details)

Uploaded CPython 3.9Windows x86-64

palletjack-2.6.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.6.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file palletjack-2.6.0.tar.gz.

File metadata

  • Download URL: palletjack-2.6.0.tar.gz
  • Upload date:
  • Size: 258.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for palletjack-2.6.0.tar.gz
Algorithm Hash digest
SHA256 4e446f7c051f78985ff2331b0472724aa067c235a5549d79d292c04aac18bfa6
MD5 780945ab9729536dc11ffd3abf294d3e
BLAKE2b-256 47b39fe5083e61484d7e37cc2a255d2501ce2dc471029f803ca249e2cd7e20aa

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0.tar.gz:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.6.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.6.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 180.7 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for palletjack-2.6.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 7e10ee4f88f1eaba12d628369be97e97e5e9c653c68858dc0a66edf38a2751ae
MD5 48dc6242f63529b81194ff1b9335abe1
BLAKE2b-256 819e708e11d1c8c3b902c26f62b1bfd596649f320966b54330b174aca91c1842

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0-cp312-cp312-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.6.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.6.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c2483b4cdb8b5f5622690be07552371115abf86e17ee791de28301d1f19b514e
MD5 5e66ba9da65e1ef1df8ef333075b5472
BLAKE2b-256 e717f88735b5c47585cec4df091c48692e4ea5f9e86ded4dd3fe694d247a46a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.6.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.6.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 306d286021af5d7712b3570f6be36e56e369334ad20199f085b641a32112638d
MD5 07ca6591f7b10a35d40c6acd484bec07
BLAKE2b-256 22bffc7c4378f35d233394ebebfc91edd69f342708cf683abb6f6e0a1229cd92

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.6.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.6.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 179.9 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for palletjack-2.6.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 09b3f423541c87b6f4b959ec29bf0ece8e47610f11792f40f5dd8fa3f066857c
MD5 888a93bf36822bdafd95edbb622d4f3a
BLAKE2b-256 e08028acc7477fa9862c9d4a1ac46203dcbcde3f36c2bcfb5b58d7953fd544ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0-cp311-cp311-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.6.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.6.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5a2a353e2e1acc2bc1f0372b551dee94e3e6c2082fbea83f3ba91da1f31e83a4
MD5 64e0bb8f386dd23de0e4821167933f6d
BLAKE2b-256 0c9e5dc7a40f7b80fed515e90f9ec0711f32fb7d1f1f332876763f6ad56478a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.6.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.6.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 32f2d6295bfc3d53d802f90940d0685137ffdd0b0fc2d9a87cd9bf0551bbd43a
MD5 5166cadf28432e4cddcd266a0502736f
BLAKE2b-256 9f8b0f5172963320974da31cf9031f1be2b62c83318fa51f5ba8e99b2ba732fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.6.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.6.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 179.8 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for palletjack-2.6.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8e1b9980f88c82a5c29a11c3da0399eaae4db637cffc7d1c90e3c3d3dfbfb005
MD5 0053a16ec814701f83422edc17232843
BLAKE2b-256 56f8dcdf3548dbba5ed83c5bfd9755024e510a8ab92f0ec2b6079803c6c5c56e

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0-cp310-cp310-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.6.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.6.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 391290ed6afe010d6825b9da0379f80a6f744165fb29bc8ca1c945cb9f253fac
MD5 8a6446f9da6dac8aa4834077ec416fdd
BLAKE2b-256 3a2905ac3ead647e3729144d33683e32ad1b4f985660afbf2110c365bd41ece2

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.6.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.6.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 eb9c9d8507675fd3427424124a615bd8164f556ee46d8758fb33c69622290cea
MD5 77e60fa0304983cf7b8b0f10b4dc2e9b
BLAKE2b-256 b90a5bb1ed10a99b38f53d12c59d79078498681910a0e594dc2aaa7dd19810fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.6.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.6.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 179.9 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for palletjack-2.6.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 1116134f762ed9b1c67fa47217c0ac87d579ffd9bab8eaf888517534107be994
MD5 41af58c27fda5fb88a9e21d81d0af7ee
BLAKE2b-256 84736e64cf1f20fea47ba97980be2c96b5a69ac96e2ba23da6c32ab6c2a89172

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0-cp39-cp39-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.6.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.6.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 21ea6996abbd29a7b97780184bcb8b1cf120aec5b23253e7c074e09a119fc854
MD5 42b7e932c7a3276c15adeb4f5144c18b
BLAKE2b-256 348a90fea471762a41bf9bf2d623fdc277533b0eb3e79131ea6adbeca902aaed

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.6.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.6.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 955364cea0f68bbe7d0810309616f9ca3d67c86dbf27c761f9369b488f9aef5f
MD5 1aaced48605b47969ba25ace665f29fe
BLAKE2b-256 b9e2b2ff23902f3fb1c5888ef1bdc6cf5995f539e1100192f2136ee0e6fbc56f

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.6.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page