Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 21.0.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Sparse reading

np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = [0]
                        , row_ranges = [slice(0, 1), slice(4, 6)]
                        , column_indices = range(pr.metadata.num_columns)
						)
print(np_array)

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.16.0.tar.gz (169.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.16.0-cp312-cp312-win_amd64.whl (78.3 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.16.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.16.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.16.0-cp311-cp311-win_amd64.whl (77.9 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.16.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.16.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.16.0-cp310-cp310-win_amd64.whl (77.8 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.16.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.16.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.16.0-cp39-cp39-win_amd64.whl (77.5 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.16.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.16.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.16.0.tar.gz.

File metadata

  • Download URL: jollyjack-0.16.0.tar.gz
  • Upload date:
  • Size: 169.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for jollyjack-0.16.0.tar.gz
Algorithm Hash digest
SHA256 ce5ebb8792df4143de1f0b878fe626dcff5c6cdd5ad36916497dd9638ff3e535
MD5 babf52015aaacb625ce0463e893a359c
BLAKE2b-256 ba3deee5b92b0c04bc751420caebb2082064171520bd63a501dd3a5ab737be40

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.16.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.16.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 78.3 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for jollyjack-0.16.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 5e483eac8e158ca21e26efb60aa4713e08765cd3cc5e9fca8ca244e222e50bfe
MD5 7d5cd82d1a2ec263ccf3ad229c0f7872
BLAKE2b-256 a1e1310673f96ce0612ed2fa2945aab9dd47ce6fbe63957e92eaf29f0a0d338c

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.16.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.16.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e3b9f84eae56c2e9bd0b0fe6926bc7d410b5fbaa6dec96045630f313e659a9e1
MD5 50886b68794f3f0a8d59d431cf4e6763
BLAKE2b-256 0d3c427f6785896bca5212de10aaa1748dd2d5f72b307b4cccac856b37fa5dc3

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.16.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.16.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 07a1ac8b13a54d3de8ece24be36da279e7bfbe4d6f3265e9b572311862b53f15
MD5 b8d536b8bd9301cbb34f92031b1e6b2e
BLAKE2b-256 6b6fbfc37d882b24c516a53c8ac7bc80806230a713e577ac5db8e4479dc7cadf

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.16.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.16.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 77.9 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for jollyjack-0.16.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 47a8371fa2f642d76ee74c52453dfec5adf5a6a454c2bd0976b3b08a00a4440e
MD5 4f261d917aabef00c2da500531495678
BLAKE2b-256 11899e38b4cc5a0355e996897752d2868298e7c69536956903b7a73fd745b6b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.16.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.16.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b9b973ee0b59ecabe1f18f5c69e2f014bd85f958c94298f15c0583f1c09d6787
MD5 f195538989e6881699947272c7ad028f
BLAKE2b-256 a0986536e50287d35c9c98c1bf697ae9d978b47645c7db4b3065c4552512316a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.16.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.16.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 4475c1a2c1af0bd963ae28bbe3a3e9d028b3d93f92e4ea4c34803ace0e4c20af
MD5 78f36aa8687eb17cf290f3274ddc7eb9
BLAKE2b-256 800a17a01fc42536699c5503f44a20d808dcf21e682ccf019fe540092da248cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.16.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.16.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 77.8 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for jollyjack-0.16.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 77e20cf014add60cf573290adbd63478723447b524a2f5ac012cc27dd46fbdf1
MD5 6ce843397dda06d4df08d1d951ff65c4
BLAKE2b-256 05614a367689356a8c1958fe1f9519f1a54152574836598df440bb0cfb552c7c

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.16.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.16.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b433a74656552f410d7aa9e61a769baffe35b0c96842c220770d3ba93795a6a5
MD5 a6575b516ecb979dc02248ccbf63a77d
BLAKE2b-256 084570f625fcbe91b421e78d54e2998f0177c322b922f9da26b0db8edfe1838e

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.16.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.16.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 19d9371d39c8d479c527452bbb3f78520d8d2f02327d59f0d2c7b5107abf3bc7
MD5 1bdc7963b299e4682b1186289539f3d2
BLAKE2b-256 6d791cbee196498a7f99937b4d3206349873d72e547ad78788388067b21987e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.16.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.16.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 77.5 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for jollyjack-0.16.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 6ff804a2616a70e1c4647e7e424f0b76e07b87c77aec2fed83b2c0a5b35aeda8
MD5 83d767004d4b3ccc442ceccb7e9c435a
BLAKE2b-256 6977565c9f523e2f562b0edbed054a7a40927a9a7ca11182444dbeff995cb313

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.16.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.16.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 785e1b0184726d7de0b04c73f214dba22027ec58b4ce44b09a1a0823d2aacc59
MD5 1939168fe9aed4dc268858c4b05068ea
BLAKE2b-256 6985793790be36fdc31d3a23fa3ef47b1e28c07376700e7bb5f58ecfd99a5401

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.16.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.16.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 91673770f4b9feadd7ee08e094d5c011334ba3334414742457ba4b63e9613a1e
MD5 ad9b3deb62c0ee484b45f20d09db8678
BLAKE2b-256 7047840a56c9b4fa3e8edf4e98a2fbbec96ae8120799235c37aeceb2c9ba52b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.16.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page