Faster parquet metadata reading

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

PalletJack

How to use:

import palletjack as pj
import pyarrow.parquet as pq
import polars as pl
import numpy as np

rows = 5
columns = 10
chunk_size = 1 # A row group per

path = "my.parquet"
table = pl.DataFrame(
    data=np.random.randn(rows, columns),
    schema=[f"c{i}" for i in range(columns)]).to_arrow()

pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=False, store_schema=False)

# Reading using the original metadata
pr = pq.ParquetReader()
pr.open(path)
res_data = pr.read_row_groups([i for i in range(pr.num_row_groups)], column_indices=[0,1,2], use_threads=False)
print (res_data)

# Reading using the indexed metadata
index_path = path + '.index'
pj.generate_metadata_index(path, index_path)
for r in range(0, rows):
    metadata = pj.read_row_group_metadata(index_path, r)
    pr = pq.ParquetReader()
    pr.open(path, metadata=metadata)
    
    res_data = pr.read_row_groups([0], column_indices=[0,1,2], use_threads=False)
    print (res_data)

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.2.1

May 15, 2024

2.2.0

May 8, 2024

2.1.2

Apr 12, 2024

2.1.0

Mar 22, 2024

2.0.0

Mar 13, 2024

2.0.0rc2 pre-release

Mar 12, 2024

2.0.0rc1 pre-release

Mar 5, 2024

1.0.2

Jan 24, 2024

1.0.1

Jan 23, 2024

1.0.0

Jan 23, 2024

0.2.3

Jan 18, 2024

0.2.2

Jan 9, 2024

0.2.1

Jan 9, 2024

0.1.2

Jan 4, 2024

0.1.1

Jan 3, 2024

0.0.9

Jan 3, 2024

0.0.8

Dec 29, 2023

0.0.7

Dec 29, 2023

This version

0.0.6

Dec 19, 2023

0.0.5

Dec 18, 2023

0.0.4

Dec 18, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

palletjack-0.0.6.tar.gz (171.8 kB view hashes)

Uploaded Dec 19, 2023 Source

Hashes for palletjack-0.0.6.tar.gz

Hashes for palletjack-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`6b7649f025d6d7e111c660f36ee813b8e020a98ffa5cfdfa94901f7e7544efdf`
MD5	`78a214cfe56132353c6f15c69706d3b9`
BLAKE2b-256	`c8b1a80225373814c7d357336912d32f2c2d91337e8a96eca170aa8c77b9b3d8`