Faster parquet metadata reading
Project description
PalletJack
PalletJack was created as a workaround for apache/arrow#38149. The standard parquet reader is not efficient for files with numerous columns and row groups, as it requires parsing the entire metadata section each time the file is opened. The size of this metadata section is proportional to the number of columns and row groups in the file.
PalletJack reduces the amount of metadata bytes that need to be read and decoded by storing metadata in a different format. This approach enables reading only the essential subset of metadata as required.
Features
- Storing parquet metadata in an indexed format
- Reading parquet metadata for a subset of row groups and columns
Required:
- pyarrow ~= 15.0
PalletJack operates on top of pyarrow, making it an essential requirement for both building and using PalletJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.
Installation
pip install palletjack
How to use:
Generating a sample parquet file:
import palletjack as pj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np
row_groups = 200
columns = 200
chunk_size = 1000
rows = row_groups * chunk_size
path = "my.parquet"
data = np.random.rand(rows, columns)
pa_arrays = [pa.array(data[:, i]) for i in range(columns)]
column_names = [f'column_{i}' for i in range(columns)]
table = pa.Table.from_arrays(pa_arrays, names=column_names)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=False, store_schema=False)
Generating the indexed metadata file:
index_path = path + '.index'
pj.generate_metadata_index(path, index_path)
Reading some row groups using the indexed metadata:
metadata = pj.read_metadata(index_path, row_groups = [5, 7])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()
Reading some columns using the indexed metadata:
metadata = pj.read_metadata(index_path, column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()
metadata = pj.read_metadata(index_path, column_names = ['column_1', 'column_3'])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()
Reading some row groups and some columns using the indexed metadata:
metadata = pj.read_metadata(index_path, row_groups = [5, 7], column_indices = [1, 3])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()
metadata = pj.read_metadata(index_path, row_groups = [5, 7], column_names = ['column_1', 'column_3'])
pr = pq.ParquetReader()
pr.open(path, metadata=metadata)
data = pr.read_all()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for palletjack-2.0.0rc2-cp312-cp312-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb19689ad4401a412448e5b706b34b0492f3fc1fb7e4beac18804708afbab614 |
|
MD5 | e798cfec968029cbcfdecba5821ae9b5 |
|
BLAKE2b-256 | 2b25a3b3130a6167ea8541d32dc4e51cbbc25cc31fa90d774a70cf0575231286 |
Hashes for palletjack-2.0.0rc2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25b6192c9a5e32ecf924b39946830fc21b86af22b4d4c07932aa4c70317fb51e |
|
MD5 | e4bd18cc168c97b8c018c9de545c7f86 |
|
BLAKE2b-256 | b3d7eb8476ceb26534703db944e29fbe9dbf661e81d67b0407a2d00fd91f1f7f |
Hashes for palletjack-2.0.0rc2-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 045a497dbe9285e49f37d65abc0270afbfe12c1cea3e5ba059e5af918f4a4b54 |
|
MD5 | e2bc3452a2eb72a33913f1fa20faeed0 |
|
BLAKE2b-256 | 9a80c3d1b89e71a0b2210e604cca0f2e455ed9c3e413c1b1735c6712ca36eee0 |
Hashes for palletjack-2.0.0rc2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46c1df96b3a2a2b6f805e3508041867c8a1b9829c0edcf00593933f17ba8025c |
|
MD5 | 4e37d65db39dd677907901b4c5e652aa |
|
BLAKE2b-256 | 15e53599b72cf9e7aff99f2f857177821090fd2adbf7c4f0f44b0f79e4f9f686 |
Hashes for palletjack-2.0.0rc2-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | af46106aa9070f5771f511daf0214f96ca98d5e2293d7fe3f2af0bd3b00a347f |
|
MD5 | 699ffba35507e673a994e25fa8fac1eb |
|
BLAKE2b-256 | 2e9009968c19fffab9ceed4c27b243ae2a68d07af58afd7fd913016138ad0bcd |
Hashes for palletjack-2.0.0rc2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 914d2cfe65e4cb4a5bc320ef8c696eb09535d8cbd3be40525d2ecc0a4cd47dcc |
|
MD5 | ad68a895c165f3e436fff06b3f2abc66 |
|
BLAKE2b-256 | 679fa79862e8aabdcecfc0952c4e061d5188bd51904d8b0e962ee51024214c0e |
Hashes for palletjack-2.0.0rc2-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6ec842e73bd3453004209548503713b2eba98c58a4a928afc1d6b16eee28cc4 |
|
MD5 | 21455a3f0d48111284d23bd9a132d7b2 |
|
BLAKE2b-256 | 950eeffc0c953a1534fd03d86d4b5a4a454a98ccf86ad28b18172c54a4e0de71 |
Hashes for palletjack-2.0.0rc2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5431615948cacb4a45296a774b22b70eaa129034acc5177e41f97994a6bb5f1d |
|
MD5 | b3e209ed3c0233cafd731d0fba5a44c9 |
|
BLAKE2b-256 | 943580755f3140fcdc9ed761a563ee0c65f2751bbcd5fb4d1537e352d7c09fb1 |