Skip to main content

Faster parquet metadata reading

Project description

PalletJack

PalletJack was created as a workaround for apache/arrow#38149. The standard parquet reader is not efficient for files with numerous columns and row groups, as it requires parsing the entire metadata section each time the file is opened. The size of this metadata section is proportional to the number of columns and row groups in the file.

PalletJack reduces the amount of metadata bytes that need to be read and decoded by storing metadata in a different format. This approach enables reading only the essential subset of metadata as required.

Features

  • Storing parquet metadata in an indexed format
  • Reading parquet metadata for a subset of row groups and columns

Required:

  • pyarrow ~= 21.0.0

PalletJack operates on top of pyarrow, making it an essential requirement for both building and using PalletJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install palletjack

How to use:

Generating a sample parquet file:

import palletjack as pj
import pyarrow.parquet as pq
import pyarrow.fs as fs
import pyarrow as pa
import numpy as np

row_groups = 200
columns = 200
chunk_size = 1000
rows = row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(rows, columns)
pa_arrays = [pa.array(data[:, i]) for i in range(columns)]
column_names = [f'column_{i}' for i in range(columns)]
table = pa.Table.from_arrays(pa_arrays, names=column_names)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=False, store_schema=False)

Generating the metadata index file:

index_path = path + '.index'
pj.generate_metadata_index(path, index_path)

Generating the in-memory metadata index:

index_data = pj.generate_metadata_index(path)

Writing the in-memory metadata index to a file using pyarrow's fs:

fs.LocalFileSystem().open_output_stream(index_path).write(index_data)

Reading the in-memory metadata index from a file using pyarrow's fs:

index_data = fs.LocalFileSystem().open_input_stream(index_path).readall()

Reading data with help of the index file:

metadata = pj.read_metadata(index_path, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading data with help of the in-memory index:

metadata = pj.read_metadata(index_data = index_data, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column indices:

metadata = pj.read_metadata(index_path, column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column names:

metadata = pj.read_metadata(index_path, column_names = ['column_1', 'column_3'])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of row groups and columns:

metadata = pj.read_metadata(index_path, row_groups = [5, 7], column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)

data = pr.read_all()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

palletjack-2.7.0.tar.gz (262.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

palletjack-2.7.0-cp312-cp312-win_amd64.whl (176.2 kB view details)

Uploaded CPython 3.12Windows x86-64

palletjack-2.7.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.7.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.7.0-cp311-cp311-win_amd64.whl (174.6 kB view details)

Uploaded CPython 3.11Windows x86-64

palletjack-2.7.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.7.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.7.0-cp310-cp310-win_amd64.whl (174.7 kB view details)

Uploaded CPython 3.10Windows x86-64

palletjack-2.7.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.7.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

palletjack-2.7.0-cp39-cp39-win_amd64.whl (175.0 kB view details)

Uploaded CPython 3.9Windows x86-64

palletjack-2.7.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

palletjack-2.7.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file palletjack-2.7.0.tar.gz.

File metadata

  • Download URL: palletjack-2.7.0.tar.gz
  • Upload date:
  • Size: 262.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for palletjack-2.7.0.tar.gz
Algorithm Hash digest
SHA256 6be0e9d32e9fe1615f75c5448be01bf773902c160af5ad29e59e57488f7fa2ad
MD5 49cef5301acb12200a367fe8b795769a
BLAKE2b-256 d8909be4d71503974093e520c7cf02aeecd16f4d3605b1f61e513c81f43db653

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0.tar.gz:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.7.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.7.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 176.2 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for palletjack-2.7.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 3f6d97456782a05c5de6b79126cbbe8cc2a2fa5c3cf8437986b56ad3c8a1de6b
MD5 fa09bea4ab850126f660545cad59f43f
BLAKE2b-256 cc2c6731fbc6bce7d13823c21bc4868bdffe38e0fc66cd0a1f8f85c3a45a366d

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0-cp312-cp312-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.7.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.7.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 db53e9a1aa9d35781edf2f5dc2f8da4bdc66a75c8fbd41d99be937c711504a0a
MD5 d519429652363c8ef493aebc784e3846
BLAKE2b-256 344874708cce373a579c5d012942381d8ba8a9d82bf2dd7c4cfa301127b5bc67

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.7.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.7.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5d58dec4d0b1b39e25359572db035cfdadc4d23c740fe72c66681daf48bd9796
MD5 032e768272c50e18ac8dcef13470bd43
BLAKE2b-256 884740b71eb4f55e2128d953a468c9a055ac3f642a24c83b35d864742fed0c14

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.7.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.7.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 174.6 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for palletjack-2.7.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 28200d8feeaab220edcfb9778bcbbf184048e5fc6986cab7248e9fea9baaff6c
MD5 96cdf8008767024889da4341ad587a5a
BLAKE2b-256 b9c9882e21bdd7f3546b98f192be45d4c06edf77e10e70ca59a6d8349843e0b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0-cp311-cp311-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.7.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.7.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 73d39f95bd53db67f6d55a648df23cf7182b8006c5b9cf4041ac196f1cb2aa1a
MD5 109f6e6bd7b1a5a9c1afe5019283a562
BLAKE2b-256 6fb7f2992f4cf11dc625b5f38b54c03855967295af67f216fc89e31d0021e192

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.7.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.7.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 e0f765cdcea3f0a2c193cd1e60fd6ac9e2dc826a3dd285adab26aa8b5684b125
MD5 b8aadab55032c2e831fbd25ecd4a62f3
BLAKE2b-256 7ff0525564b93bd2499cae88f7c9cc7e9ad3aebda70c8b11cdf4e0511d937e0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.7.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.7.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 174.7 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for palletjack-2.7.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 42e64acc86de94c2da98ec34fefc96792d570516d08eaa8cfca1aec521e5826b
MD5 208782cf57261e6e0964cb92bfdce82c
BLAKE2b-256 56b0f7403be4f1c4335aa31e818a241779a6d68a7e7ca2ff0c3058a205163801

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0-cp310-cp310-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.7.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.7.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c2c39852b3655115242f78513521e03c90322ac9481b481208e99108a7dc6a75
MD5 8e643dae27b18a8d2bb7a433e16f488b
BLAKE2b-256 86a09fbd86ea83cf83f0a923d60c383f944c744bda6960e2a432714c60c5077e

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.7.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.7.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 e56669d9acfb1246f848bcb678e57f74a5feba8ade75c0d860ed0699fbae4efe
MD5 33c9bcd4830bb1f6195f2d7b4713b554
BLAKE2b-256 3df39827825add2be9fa5589d75cc8d5bfc08f5654b86175076137ddaadd73d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.7.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.7.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 175.0 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for palletjack-2.7.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 d553e6aa561f564a390f315b4e7b395663a96513d98aaaf57276e11ae108b2f5
MD5 36a083070ed0a436fd12a692fdc252a9
BLAKE2b-256 4b329f8198b5c57ff80307e668d3a4d555934548d2687688a3e0d50d2f7fb047

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0-cp39-cp39-win_amd64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.7.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.7.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d836523e8547e1fa8a79299e26e6f6b5239e0d0891b58bdc024d4bf9aa77f11f
MD5 22e89a7d27c401e338f9e9a997b555aa
BLAKE2b-256 1cc2443b472391e86a57745ba43e97887abe59add3ec109594b8a991b264f0ea

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file palletjack-2.7.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for palletjack-2.7.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 ea89473aee31f4f0147f2fa3d9a5a8af5ea74d0dc33dd44b3c3c54678a88602a
MD5 5258216df7a0dc091ad3fd0a733d864b
BLAKE2b-256 fdeec388767b6c382b3879fd241eb5591c1ec19011daaf4505db6849bbe0fbc6

See more details on using hashes here.

Provenance

The following attestation bundles were made for palletjack-2.7.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on G-Research/PalletJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page