Skip to main content

Read parquet data directly into numpy array

Reason this release was yanked:

broken

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 19.0.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Sparse reading

np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = [0]
                        , row_ranges = [slice(0, 1), slice(4, 6)]
                        , column_indices = range(pr.metadata.num_columns)
						)
print(np_array)

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.14.1.tar.gz (158.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.14.1-cp312-cp312-win_amd64.whl (81.1 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.14.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.14.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.14.1-cp311-cp311-win_amd64.whl (80.9 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.14.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.14.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.14.1-cp310-cp310-win_amd64.whl (80.5 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.14.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.14.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.14.1-cp39-cp39-win_amd64.whl (80.4 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.14.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.14.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.14.1.tar.gz.

File metadata

  • Download URL: jollyjack-0.14.1.tar.gz
  • Upload date:
  • Size: 158.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.1.tar.gz
Algorithm Hash digest
SHA256 e88edbf73cf1b3fa38d183d2c0cfda081ad7d94fd3ccd21b5898d9774f4b19e8
MD5 abbcb0b1221b3c31ebbc59622e7f1abf
BLAKE2b-256 577c6d285cfd18275a63d1119e3dc22ede9eb192e950fae7ada17b42c416a72b

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.14.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 81.1 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 e247a18ce6cd18f30530af90de133629bac860c269734fadd0ff239d2139b998
MD5 2573cef6a11752175ffbd6e0245a0353
BLAKE2b-256 da287b13c28ccf30abe211b2f802b80901d0a19ffa36ced01cfd177820d427f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a9eac3e3cd88d8b4d68f187679ca7276f07f93c20b4aa1fe259bb99d7392d001
MD5 d00a84b93a49a0b0f16f44e8d91675e1
BLAKE2b-256 faa84511749d39b866fb50a23bfb03b13adac6023f50362fc97e7008ea9a069b

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 cbad6429a44e8866fed333f00b7fc2169b13837590af4badc6e3c855b81e7f03
MD5 97fa423a58cc38a1fe31a9b6a674cfac
BLAKE2b-256 7d5bde80ff5a8e0f4d15336642f5179d5a63334ec14d0beb42052977c6d952db

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.14.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 80.9 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 deccefbf281992a89981e3c5363bf96ec65f39b24c174a407faaff9127fde56d
MD5 5855f3ef211c96f129915c1a3ff5e845
BLAKE2b-256 880fc9983c70eb603f62a45e41dda8164ecd2cbf31bbed3a541d01067f4f2039

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 56ff603d1c3047831d819df6b31f1c3de86beebe6eea7365eaf4229eb2496bc0
MD5 5802198d8463494164be1b527a033cbe
BLAKE2b-256 6fcb9db1e1ddbf799f3260b9f45ddc9952b8d1fbc6e542cb0eca6ee5867cb614

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5c968950ddf9b0d676f1ecbc3ff40d124bf21efd19ea4603edb72d48485f8a8c
MD5 657fbc51cb0ba7b322b76f10087dbf73
BLAKE2b-256 8c20b4ce0e1d1c241c4902987b37c78fbe1ccbaa503f911d7f1b6bbceb0c901e

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.14.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 80.5 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 9b1a7b80ac863f79e6b770ad678573c3d80dba49ae2f36c104c6838306510df9
MD5 75e8ba4eafbc0de9cd11606215d91100
BLAKE2b-256 fe73fe358c8bbd715713ecd37c0282c10c4a2b4815c702d17f835d183eb619f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 dcdd7ea32fc73a48f00b7df649154cd582ae3d10a06ac1fabd2e0c48885f4ee2
MD5 433c877bee021de78a7e22cb8624782a
BLAKE2b-256 e583f9dce569d8ecb97574d1b9258587c0bfe24543a8a4b991c6db83d8e7fdc9

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b2fb989029ea05d91d04dfba1272ec064179a2367a5ddd46d3cd21961c5bd0c2
MD5 58b1ddbd3c8610d2384bb30302902a1e
BLAKE2b-256 547782d659b5261b2e91a13a72b7e0cb91e5167f302a060d009c7dd0537af625

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.14.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 80.4 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 50244e0b31f2ec8f255b7b2676bc002bd909b348994c7067bbff85d62ac82312
MD5 94e1c2602301112d4a70139799e3f8d5
BLAKE2b-256 832b2e534c3daf8f2be40ba467c8005eace135aa6c9de12e6453d5504625c2ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 867185504a960cf5a0287fc0bc6d5f0e8a85d38085319fec20a2464c88659def
MD5 f3ca83b45605425697cc48824a087734
BLAKE2b-256 7b9bd62213b1438007e5114b67736c2055c783d453a92670894af0ac6cae6117

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 0d8d487dbc3289951dcb75f8420b2d5dade6fa433f17ee51b48b906c8c5ef024
MD5 7c87d200e10209245612e8c0f959ac8b
BLAKE2b-256 513f77209e40d67166bb92f06c2d86518cf15ec45f04aca070ef685b9565e6f8

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page