Skip to main content

Faster parquet metadata reading

Project description

PalletJack

PalletJack was created as a workaround for apache/arrow#38149. The standard parquet reader is not efficient for files with numerous columns and row groups, as it requires parsing the entire metadata section each time the file is opened. The size of this metadata section is proportional to the number of columns and row groups in the file.

PalletJack reduces the amount of metadata bytes that need to be read and decoded by storing metadata in a different format. This approach enables reading only the essential subset of metadata as required.

Features

  • Storing parquet metadata in an indexed format
  • Reading parquet metadata for a subset of row groups and columns

Required:

  • pyarrow ~= 19.0.0

PalletJack operates on top of pyarrow, making it an essential requirement for both building and using PalletJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install palletjack

How to use:

Generating a sample parquet file:

import palletjack as pj
import pyarrow.parquet as pq
import pyarrow.fs as fs
import pyarrow as pa
import numpy as np

row_groups = 200
columns = 200
chunk_size = 1000
rows = row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(rows, columns)
pa_arrays = [pa.array(data[:, i]) for i in range(columns)]
column_names = [f'column_{i}' for i in range(columns)]
table = pa.Table.from_arrays(pa_arrays, names=column_names)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=False, store_schema=False)

Generating the metadata index file:

index_path = path + '.index'
pj.generate_metadata_index(path, index_path)

Generating the in-memory metadata index:

index_data = pj.generate_metadata_index(path)

Writing the in-memory metadata index to a file using pyarrow's fs:

fs.LocalFileSystem().open_output_stream(index_path).write(index_data)

Reading the in-memory metadata index from a file using pyarrow's fs:

index_data = fs.LocalFileSystem().open_input_stream(index_path).readall()

Reading data with help of the index file:

metadata = pj.read_metadata(index_path, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading data with help of the in-memory index:

metadata = pj.read_metadata(index_data = index_data, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column indices:

metadata = pj.read_metadata(index_path, column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column names:

metadata = pj.read_metadata(index_path, column_names = ['column_1', 'column_3'])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of row groups and columns:

metadata = pj.read_metadata(index_path, row_groups = [5, 7], column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)

data = pr.read_all()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

palletjack-2.5.1.tar.gz (257.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

palletjack-2.5.1-cp312-cp312-win_amd64.whl (181.0 kB view details)

Uploaded CPython 3.12Windows x86-64

palletjack-2.5.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.5.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.5.1-cp311-cp311-win_amd64.whl (180.0 kB view details)

Uploaded CPython 3.11Windows x86-64

palletjack-2.5.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.5.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.5.1-cp310-cp310-win_amd64.whl (179.9 kB view details)

Uploaded CPython 3.10Windows x86-64

palletjack-2.5.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.5.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.5.1-cp39-cp39-win_amd64.whl (180.1 kB view details)

Uploaded CPython 3.9Windows x86-64

palletjack-2.5.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.5.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file palletjack-2.5.1.tar.gz.

File metadata

  • Download URL: palletjack-2.5.1.tar.gz
  • Upload date:
  • Size: 257.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for palletjack-2.5.1.tar.gz
Algorithm Hash digest
SHA256 bb61406fe52dee7b9cbdf161ba5100dfaf300c64ef5f6ee9dd8d6aa6545b291b
MD5 ad1b9b314df2beaad11002338825cd25
BLAKE2b-256 687bc7d58a0d6136942279ee249f01ac759453904fbf9e0c7297a9cd562b9b85

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1.tar.gz:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.5.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 181.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for palletjack-2.5.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6b25cf71201cb0520ab36778b9969fa8b8836dcd6a4f91c7d814b68e5b39683d
MD5 d523bbda39532e8524d7224cf7140cad
BLAKE2b-256 d4783849554e2d5f5f3723ff7ca43289d02665d8350c70f2c458993eb372cd54

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1-cp312-cp312-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 77fb59c7d423eac36d521b90ce12677a1bb67d195390eb436490d4468be1ab50
MD5 85c4078365b4becb20a2a67d5765714e
BLAKE2b-256 956be1dc0b77149c256fa72c915068dca580abc29b11ce68395575b6631e77dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8339633caa28bcbf125ff4d9ef0125736de302ffd7e32d932beae783e2230c71
MD5 fe5045e9b5fb1172b3b62db2b91cf9a4
BLAKE2b-256 cba71f79e77e93e309d47a1e4f7cbb6c443372ee5308940c19e9ae2fe9ffd970

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.5.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 180.0 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for palletjack-2.5.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 4d15d5cd8637c6fa14fa95e14b5dcab98edc7bd9a2ea548603bcbaee3719ce57
MD5 6d700dadb06cfd81f9f858afd2487af0
BLAKE2b-256 aea6b7cc47b7d6d59b20996d632a6737c7c65dd7de288d95debafb6847078684

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1-cp311-cp311-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 18c68c62051ccb7059e1c7dfcdb43697a551c843f5246645a042dd38c47d9f32
MD5 38f8ad4e15c014dee1305e9177584c56
BLAKE2b-256 95085531b74f9654088e76be13d9d23f47ad94d586e01710fb75f545a4111d4d

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 25ce13341763744897e2fb9c11b75067c72f8ae18efb80919db43045d6f7208d
MD5 acac29bb84c8b1004193d881f0a1c66b
BLAKE2b-256 017e1e7ff8bb016e218506ba6884f547b8530e39724783d4f7958bdcab64eb6b

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.5.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 179.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for palletjack-2.5.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c5af5e5f9401f9978e8deb59f1677dcef3763cf723392265c79d4d383d3619a3
MD5 5fcc3a6e9f3acbbd1a2291e48e8460fe
BLAKE2b-256 e4e8e0c1b13d52413befeb88ae236dd5db9227f1725cf9ffb218a078e784e7b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1-cp310-cp310-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 40b2c442a971965f54fac33c4e3905732c594b370a3e6be16b47f7e1f7e9821d
MD5 40b6a57cdca1a192ded926aafd41d5b3
BLAKE2b-256 96dcfb2aee716be00555c4b5c207963162f05725a8e0390aa14505a46d61af75

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3b36bf4315b444a072bcef5f904c4a0cf5ad423893cd26f7c01dbbda88604b3c
MD5 078c5875e6842a19de1a8520eef50877
BLAKE2b-256 c1bfe445e7e42892cd26a1cdb3772167b9518c65e1200d31f5e6b906ab97c904

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.5.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 180.1 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for palletjack-2.5.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 dc25c509d0512939f41ff34107eb178079f40ffd926ddde68374f756d25f6e86
MD5 121fe5b521605841824a5676e2ee9951
BLAKE2b-256 249ab3fb003fc09bb5f15017cdcfd1bd490cf9bb0ed324f0e357bcae1398e1b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1-cp39-cp39-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bd81c244342b7f70f6547ade98a6c6921dd1b16bfe945a37d77ee4dd1cc6cbf4
MD5 b00116d5cf3cc1b9abb93bd6455e9601
BLAKE2b-256 ebcac4093138fbe37108b6e966ac121febe4346503ffa52bcaf886fd22ac2883

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.5.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.5.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 02f89cfc1ef28943175f0b0a95988afd1dc3cd6cab968b2eb2174f5f9e4885fb
MD5 39aee585a2875e452da6e194ee335404
BLAKE2b-256 33f6e8a01dab63cc6b99abff2154c337459967acd3a60470c99b5c11b6d4eceb

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.5.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page