Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 17.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.11.6.tar.gz (151.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.11.6-cp312-cp312-win_amd64.whl (75.1 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.11.6-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.6-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.6-cp311-cp311-win_amd64.whl (74.8 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.11.6-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.6-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.6-cp310-cp310-win_amd64.whl (74.5 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.11.6-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.6-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.6-cp39-cp39-win_amd64.whl (74.5 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.11.6-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.6-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.11.6.tar.gz.

File metadata

  • Download URL: jollyjack-0.11.6.tar.gz
  • Upload date:
  • Size: 151.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.6.tar.gz
Algorithm Hash digest
SHA256 035ff6c3e1e027977f82d313abe3e39a55595bf13c3b3adc670deb958e7d13ef
MD5 bef46c666a9bf79aafa76abda387fc99
BLAKE2b-256 ad2091d83f334e369e5c5b5ba283e3eec6356e0bf0d22ec5f1b06709aa811b8e

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.6-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.6-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 75.1 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.6-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ec5176d53bc0e4f4ce4fbd5c62d665acc867a60439a9ca7c3f1deaad173acd07
MD5 297b698477765e930233c8ececda5e6a
BLAKE2b-256 a5d6e9e89a8c079de311c0230c658840122d79a8532b80ea82bb897a43eaf6b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.6-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.6-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d3c036ba2d2c338c8e6929db32f1d7ea0fa660057561eca197251de4bd0199c2
MD5 aba39a6388d7a9af1115c7cc9ab6211d
BLAKE2b-256 3d510aa949e5049b653db538e43f57eb2ed850cc03fdc6a4ec4a27e1b3a0266f

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.6-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.6-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8fb5250727ad016127abd94358ce4d78658c1075fdcc786363c0ad5777412a19
MD5 e9349453761f27bfd2110653d88d67af
BLAKE2b-256 5a511eabe115f95d6e7651dd74b840bc6cae1115bfe17ca173c28bb5b0e0907e

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.6-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.6-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 74.8 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 a19d6b896c0768ba237b9653a90fdd59b4a91a67454b17401b0ec53ef3f21da3
MD5 579abbc65cf003a2781d52ffe00806d1
BLAKE2b-256 f53e5719ab3b5175191874104b5fbc16adba1948efba8ea5c11dcc0383adede4

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.6-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.6-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 12ac5c6afdaf268a1f161bcea5c2e239bbf47e31e6aa719626630d2d8b7d172f
MD5 db774214bf17fb003178c2f01b912b0c
BLAKE2b-256 2ba5a33d442e712e525a6292553d56e6634e393b94ff217efc14663b5f31633b

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.6-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.6-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 487f8a6c447f6042c02621c67b7e7605de642c4639a9de04df99f8aacf0dc015
MD5 fe8525a89037d19d4f63fd89672b89d1
BLAKE2b-256 fa1b8fbf780a9533a458b732f9abdc5b7fab75a99598c3e731cb12b520d1f52b

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.6-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.6-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 74.5 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.6-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 4c93f680e02ebdd41cdca8560238d808d9b89c67dd6876e9a59c5cb2d41b38d8
MD5 e10f612c69020e5c0f3ef58088033d40
BLAKE2b-256 03491682a572a5751b2839c21d937ea76e6a18376720902ee53be53354ca06cd

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.6-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.6-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0677200a379b0ba9d58139841f51c7155fda8e8e2d074284ad63b1d2ca63cd83
MD5 8001a449f7cf5840c2576682f710fc36
BLAKE2b-256 7a2f4986abc50f0cd4d53abedeea68fdf08f38b8e4ecb8a58757058ffcbb1a45

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.6-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.6-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 383d57f96bbf2c9c8bab688c2186a254af62573a6b745e6df12925d169b3b01f
MD5 e91c50e0478fc54ca9f7027c2da77846
BLAKE2b-256 e0343cfc2360aa27b4e0aae45772be4d1b610ec6dcaece69579c15a4c9b0728b

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.6-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.6-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 74.5 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.6-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 10fb2d96f466c5b85bcd1dd302fbcaeba25515194a39bf6c52d80c079549ec74
MD5 4636246acc47bbd4aae45f1b13d5589a
BLAKE2b-256 e387e1645b1f3a1f3571bb74c8bd0fec1d294f6dd6ac81ed8f6d57a92d5d63e1

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.6-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.6-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 33648e2a6d402eb277bef62faa6d6afb0a30a8065e963e3d1561ab5cb256fbb8
MD5 bc7d9e5b1260a583d8243be69908b384
BLAKE2b-256 2635772e40c7b9592ebced6c7f7bf798e22410a12cf8a8d049fd83babc3ae89a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.6-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.6-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 a37de1986e7d49b7c5f472ed311a22d016fb101e6126198946de3536a95be53a
MD5 75c34c74a7e20a962e98b3644bc6e5f8
BLAKE2b-256 cf2af35457df1924001bc315f151cba5b225bd730a91e649fdf85adcde390596

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.6-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page