Skip to main content

Faster parquet metadata reading

Project description

PalletJack

PalletJack was created as a workaround for apache/arrow#38149. The standard parquet reader is not efficient for files with numerous columns and row groups, as it requires parsing the entire metadata section each time the file is opened. The size of this metadata section is proportional to the number of columns and row groups in the file.

PalletJack reduces the amount of metadata bytes that need to be read and decoded by storing metadata in a different format. This approach enables reading only the essential subset of metadata as required.

Features

  • Storing parquet metadata in an indexed format
  • Reading parquet metadata for a subset of row groups and columns

Required:

  • pyarrow ~= 16.0

PalletJack operates on top of pyarrow, making it an essential requirement for both building and using PalletJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install palletjack

How to use:

Generating a sample parquet file:

import palletjack as pj
import pyarrow.parquet as pq
import pyarrow.fs as fs
import pyarrow as pa
import numpy as np

row_groups = 200
columns = 200
chunk_size = 1000
rows = row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(rows, columns)
pa_arrays = [pa.array(data[:, i]) for i in range(columns)]
column_names = [f'column_{i}' for i in range(columns)]
table = pa.Table.from_arrays(pa_arrays, names=column_names)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=False, store_schema=False)

Generating the metadata index file:

index_path = path + '.index'
pj.generate_metadata_index(path, index_path)

Generating the in-memory metadata index:

index_data = pj.generate_metadata_index(path)

Writing the in-memory metadata index to a file using pyarrow's fs:

fs.LocalFileSystem().open_output_stream(index_path).write(index_data)

Reading the in-memory metadata index from a file using pyarrow's fs:

index_data = fs.LocalFileSystem().open_input_stream(index_path).readall()

Reading data with help of the index file:

metadata = pj.read_metadata(index_path, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading data with help of the in-memory index:

metadata = pj.read_metadata(index_data = index_data, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column indices:

metadata = pj.read_metadata(index_path, column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of columns using column names:

metadata = pj.read_metadata(index_path, column_names = ['column_1', 'column_3'])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()

Reading a subset of row groups and columns:

metadata = pj.read_metadata(index_path, row_groups = [5, 7], column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)

data = pr.read_all()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

palletjack-2.2.3.tar.gz (254.5 kB view details)

Uploaded Source

Built Distributions

palletjack-2.2.3-cp312-cp312-win_amd64.whl (180.0 kB view details)

Uploaded CPython 3.12 Windows x86-64

palletjack-2.2.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

palletjack-2.2.3-cp311-cp311-win_amd64.whl (179.2 kB view details)

Uploaded CPython 3.11 Windows x86-64

palletjack-2.2.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

palletjack-2.2.3-cp310-cp310-win_amd64.whl (178.9 kB view details)

Uploaded CPython 3.10 Windows x86-64

palletjack-2.2.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

palletjack-2.2.3-cp39-cp39-win_amd64.whl (179.5 kB view details)

Uploaded CPython 3.9 Windows x86-64

palletjack-2.2.3-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

File details

Details for the file palletjack-2.2.3.tar.gz.

File metadata

  • Download URL: palletjack-2.2.3.tar.gz
  • Upload date:
  • Size: 254.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.9

File hashes

Hashes for palletjack-2.2.3.tar.gz
Algorithm Hash digest
SHA256 8f8820618860cafdf2fed36d2571f6ad993405bd6628b7b6e6d7128fe95240a5
MD5 b179111f5419cd65dcc7efb6f11619c6
BLAKE2b-256 dc44a21555cdfb62dc4a89732c49a75d0893f4c7c67a77ce5681d575fb1d783c

See more details on using hashes here.

File details

Details for the file palletjack-2.2.3-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for palletjack-2.2.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 f4fc6051f2295f0c9654fca3e1face19e2835842f797c9e92c5ba4ca318b65c0
MD5 9f6f4db445f1183fa3feee23cc2d60ce
BLAKE2b-256 426cb67d82c64a8cbedf0d2921dbbf60ed0cc5b274f0290f400974b0b5462665

See more details on using hashes here.

File details

Details for the file palletjack-2.2.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.2.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 774771a42d504e799a09b93204f51332fcc27951a2c1312678f26074b2ff15a2
MD5 0ee9cd890fc0d217f275c840463a151c
BLAKE2b-256 48b405788a2b350d0ede0231549f2298d709c674276af3c4742eb11dc0a63dad

See more details on using hashes here.

File details

Details for the file palletjack-2.2.3-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for palletjack-2.2.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 fa92fe9c090a432f97be20a93a38218a484249e72957c46fa1b96b63253a4ffd
MD5 829b2b44bf376feb6cac1f5475e85f24
BLAKE2b-256 7b9bde51401b558a9b17315d71148e2c721ab49e5e3ada551b1561119ee17840

See more details on using hashes here.

File details

Details for the file palletjack-2.2.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.2.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 44fa25a367ba862813011860104362d0d941be0f4afb881b44b06c7cf15f4a69
MD5 6633cd31dd7c347f562c539cf41670d5
BLAKE2b-256 c927bcb21a8ff2d52dca87772d945452299b5a39547336d03dfdc19d8bc850d7

See more details on using hashes here.

File details

Details for the file palletjack-2.2.3-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for palletjack-2.2.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 d4e5eeab985cac265b416f818a49d6a5fa40e6366e6483dfcd3dcc79e87bf4ed
MD5 80edcb821de0164a688993d176f14732
BLAKE2b-256 5a08764083684999b9954bcd6ddb28725ac8063f0962522498cfd1a95aea8468

See more details on using hashes here.

File details

Details for the file palletjack-2.2.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.2.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f4d0ca7694f14754e4f91f7fba22eb93dc9e014c2397dc6c199c61fe745f9644
MD5 04328036d5efd24e58cd9c6c243fb6e2
BLAKE2b-256 acdb346c393188ba36e305a83d422f52ac7baaf36e6f15a23b7607fddc369abf

See more details on using hashes here.

File details

Details for the file palletjack-2.2.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: palletjack-2.2.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 179.5 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.9

File hashes

Hashes for palletjack-2.2.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 42b84f00d9583b85b3da4571e2a205fa298525bec97f305d48a232db32adde24
MD5 b1be789459b1465055138a0a83427582
BLAKE2b-256 75c6514991eab3d77e3755a6413ee202b72d40cfd34b845d9a76c4effd87b252

See more details on using hashes here.

File details

Details for the file palletjack-2.2.3-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for palletjack-2.2.3-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b119ecfd0ff1751b1ed5cee0c55e4cfade9a2e3c52aead78150eeb5fd0a8ab62
MD5 427a53e8efff38a504f78709abf6655b
BLAKE2b-256 a490e2c1f1de26118c83d732372ab87fb33a0e5019642407864b4a3f41ebb9a4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page