Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 17.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.10.1.tar.gz (139.8 kB view details)

Uploaded Source

Built Distributions

jollyjack-0.10.1-cp312-cp312-win_amd64.whl (68.3 kB view details)

Uploaded CPython 3.12 Windows x86-64

jollyjack-0.10.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

jollyjack-0.10.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

jollyjack-0.10.1-cp311-cp311-win_amd64.whl (68.0 kB view details)

Uploaded CPython 3.11 Windows x86-64

jollyjack-0.10.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

jollyjack-0.10.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

jollyjack-0.10.1-cp310-cp310-win_amd64.whl (67.9 kB view details)

Uploaded CPython 3.10 Windows x86-64

jollyjack-0.10.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

jollyjack-0.10.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

jollyjack-0.10.1-cp39-cp39-win_amd64.whl (67.9 kB view details)

Uploaded CPython 3.9 Windows x86-64

jollyjack-0.10.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

jollyjack-0.10.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

File details

Details for the file jollyjack-0.10.1.tar.gz.

File metadata

  • Download URL: jollyjack-0.10.1.tar.gz
  • Upload date:
  • Size: 139.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.10.1.tar.gz
Algorithm Hash digest
SHA256 359c5099159cb74062b99848045d6d89d4bcb5384ddfd2133fdb7006ebb7d2d4
MD5 613abc1723a89a25e9ff0c9f6e91cae3
BLAKE2b-256 a71e8f7907420accd03d053bf5ead8715587a1ba0020e5b4204c7d62068ab729

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 f7f9a316262dfb816ce21af705cb2af9f2a3a62ddaa1a84c9cd101f2400ed5be
MD5 b61f318b8c9ca5cd32309030553c6741
BLAKE2b-256 5ef24ae43f4c8dd43e474a4ccc8ca44e8915b97a61a61de57ef27b4ba7abbe9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f10b55ee08efab5a852430d0bf8c0a289f36d2de56659ad562fa2be234631f64
MD5 d6b70c6712d2c864a4e4958c201dee4a
BLAKE2b-256 761bc8f4bb2a6f062446994198f939705197eae766007cd67b3666edeefe97ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9398e9b3273245622172ab9485d32a2029b2370f8bf32e249a113f76a30002d3
MD5 fa45c1cee0cf89dccc9c434777356225
BLAKE2b-256 78d739684009e5de9b1f43a5bb77f67aed7787d5021ddf1bf840da8ba8b7e60a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 72970aa1d384f63417bfbe45abbe9830509e8a907b411f8170fa4c354383fe78
MD5 92cad0adba2eb7e6b003c7070c28b2c8
BLAKE2b-256 fe81fa558fbf2c4c9cee81982f525373d4005ed7eb966838e3cc75e5dd5470d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 59106af363a9c8ee2342a513ded16c765ca74f3e75b3d08589f43a0b15c46a02
MD5 106570a5dc352071d2730ca08e9d1ae8
BLAKE2b-256 8db9f60fe83122e3d28ff01867fc9b03d46e3d920ea433267c2bccb62aa99be1

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 45be77b36dc52359126b6754a93302c54c607cb6c31f13f42054adca5ef28afd
MD5 c0dfdcd52c85ba5cc083c27a1c1bdaa7
BLAKE2b-256 5730b597b77258bb5666b830e1d9699749814742500aa412c4e18ba5b1ed3b9a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 53010cb2fbdac78b3de61bfcbe2b093b85a0cb0dca9f97323aac54df2a97a7df
MD5 7b16a268c6110cb1482fcd05725bd37b
BLAKE2b-256 517d29088386118707e013dc04a8cdc68465aed2fd9aabff06696278781ce621

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1d3eb2166d06900133ee1ae57f5fe3e9014749f0d8f44d309a9d8481567600c8
MD5 3c0a98be40a1ca7163e8259e59f4f30c
BLAKE2b-256 a190e1939a7745e4934c0073a08740fc1250ec954412cd97c728ae08b03a2740

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4e5ac8073e6a29fde66491b7dac2c07fe6187534c47c9c121d5866d9482a0232
MD5 6a8069d61f2ef269a33fc927f357928b
BLAKE2b-256 7a5f15d97c5ce7ea32d7d51721995f2bce7be06a193b1a259beff13428c364c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.1-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 821a5f3a0c78d1f7454d37a915ba320829b460fa0e27b296910a092b05350a01
MD5 3c9811f518e2008a2410256e845a49e6
BLAKE2b-256 7cbd8035a576ac78cd0c741f58005d3f614083a00a4941484873ab6c4f215482

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9ddc77758dc3a4452e5116b21f4b8b15bd4e25b2af109c3b292bc16b5f7719be
MD5 df5295b7281c924749e4f759c95858e1
BLAKE2b-256 7418f1ee29524a6bc5efc4b6f73ec102e85a340e9e3604d4900c3fe4fc3347c4

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9cc6c59c5f82c04d00fd4409247659ad6b68be87cb2e754f0c0f08711fe2db15
MD5 1b3308ce7b176603e0cb13a8d832ff6d
BLAKE2b-256 93a052929220d4a04557486797e146caf62bcfa65b9cdc7195de2c4274950b5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page