Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 17.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.10.2.tar.gz (139.8 kB view details)

Uploaded Source

Built Distributions

jollyjack-0.10.2-cp312-cp312-win_amd64.whl (68.3 kB view details)

Uploaded CPython 3.12 Windows x86-64

jollyjack-0.10.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

jollyjack-0.10.2-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.24+ ARM64 manylinux: glibc 2.28+ ARM64

jollyjack-0.10.2-cp311-cp311-win_amd64.whl (68.0 kB view details)

Uploaded CPython 3.11 Windows x86-64

jollyjack-0.10.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

jollyjack-0.10.2-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.24+ ARM64 manylinux: glibc 2.28+ ARM64

jollyjack-0.10.2-cp310-cp310-win_amd64.whl (67.9 kB view details)

Uploaded CPython 3.10 Windows x86-64

jollyjack-0.10.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

jollyjack-0.10.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.24+ ARM64 manylinux: glibc 2.28+ ARM64

jollyjack-0.10.2-cp39-cp39-win_amd64.whl (67.9 kB view details)

Uploaded CPython 3.9 Windows x86-64

jollyjack-0.10.2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.24+ x86-64 manylinux: glibc 2.28+ x86-64

jollyjack-0.10.2-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.24+ ARM64 manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.10.2.tar.gz.

File metadata

  • Download URL: jollyjack-0.10.2.tar.gz
  • Upload date:
  • Size: 139.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.10.2.tar.gz
Algorithm Hash digest
SHA256 0ba571645572d0dee2395e13b109e2f7468af72f48f27f0c6dc1ab1870bef358
MD5 f8312892a10604bdb9a033137064a531
BLAKE2b-256 32e261fced864747b910eb685e7957d97854ff56c2e6fa2c84c044bbcd962623

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.2-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a4a6e1461b18db70ecf7412d5d9b0e6af2e24606f8c92b4142cc9510e75a005e
MD5 5db3b9074c7e92ab5e3f38d7e0e50708
BLAKE2b-256 b721ceaa877a219e0dfc1cf512c312f331233832cf8aa1af914369e2004f10f4

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3abc9a5334a20b0b8279a2ae8f34350c998e857d42b7d8896a29a9987a40e01c
MD5 4d2e373cda573df65a72488bce15727d
BLAKE2b-256 3d0c05d7c1598b9fa97bb606f614faacb5014af10b862624035cb1e39dcb8045

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.2-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.2-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 db371d0534bdc787ed28eec0dfd119b600e1db9ef74f7ddf92791b89d779f0fe
MD5 0a8d4ad1951bbd6f0c9a34ac7d6d64cb
BLAKE2b-256 19f34072ac890ebb93f3eb96cf69f40f669fce52d2ae4b4494369122baf6e84d

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.2-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e64903f93dc164a8bf201924ddde002d9b281a6424e61037e01ba378c30a7c3d
MD5 53ecb6209f3aebf884537900875159a2
BLAKE2b-256 934ddbbd7c89d5bb3eb11d2d85297b3a98059d65e2cada11cd2c096b8a5894b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4681e620c05441967e33f1d37a2d7efc072090bbe942c595c36636a64daf6633
MD5 75ef701239bfa2c18edf39ae7d9340f8
BLAKE2b-256 a0185e670f6ff92a85aff4f78f87d2fb004237dae3c004ba488268984d063afc

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.2-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.2-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 99c99cbccd2f7b2f36903aef34b1d181e4c6bacc7d49a02770ed53445bf897aa
MD5 6d1df833a68bc36699fd34916840780e
BLAKE2b-256 677760496679bd05aa641f132e0d74784587a9b1262d8742749fc5c2cf744655

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.2-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 6e6f8b3ca056f55dd47ca0819995889130132673779f3d9bd8f61fce16e1cdba
MD5 9f98665867ca655ea75dc1297d507720
BLAKE2b-256 eaceca47732560c976671c96df5739182ef542db570fbea7b90130f95b959eee

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 30d49b449bc132c931e1d3c5e0d9ca3cc0bdf5685febb948589ad81f50c22808
MD5 1c52f23d500c02ba279f3296a8406a69
BLAKE2b-256 fdd7761f21866334bb795fa15efb8e58a0b23b65f38c9e627e22fdf361076925

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 e907d3fa0cf431033c40b4cf587ea608c19480820feca970610ff29fa4126205
MD5 eb7efcffaf93c117f1608ffdc2e7ce69
BLAKE2b-256 37f2c657eab15f422e7d93b2338f9a2c748b61269477529ec3e0d6220e586270

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.2-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 fe381074829847806fb22b03d0991d9ac9b0527ff27c57b47adec7991189b1f5
MD5 d19f76d358fd485a14e3f985d9bc7d68
BLAKE2b-256 32be49a2bf41cd9847f208435efbd4f23bd77d7d403a0333ea188a3873a61dea

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4ecb47b77c609edf6890a7fbc15e9577a05f120f4e8871923d353c9483644efa
MD5 f3d4e82a638e5726994b17407e6ef2cd
BLAKE2b-256 45fdf0878d4c7cf6f6e5c5869de03bda2fc146669118dd1dd8b2f05474b50a8d

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

File details

Details for the file jollyjack-0.10.2-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.10.2-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3bfb9f8059ef4e76f5e7d017b9bd101136d6c96885a6a36bea2cc59831682331
MD5 f4c88eda7ca2eddd57563b7339b055c8
BLAKE2b-256 5bb927812d6dd1c7985ccf623a2a7d494c3eb0afe40dd588522f3cf47a91012f

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.10.2-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page