Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 17.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.11.2.tar.gz (147.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.11.2-cp312-cp312-win_amd64.whl (73.1 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.11.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.2-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.2-cp311-cp311-win_amd64.whl (73.0 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.11.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.2-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.2-cp310-cp310-win_amd64.whl (72.8 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.11.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.2-cp39-cp39-win_amd64.whl (72.7 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.11.2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.2-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.11.2.tar.gz.

File metadata

  • Download URL: jollyjack-0.11.2.tar.gz
  • Upload date:
  • Size: 147.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.2.tar.gz
Algorithm Hash digest
SHA256 eb09691b8ca5a10867b3db04418b5bd9d9fdf30db30665535c03c30173bc9c6b
MD5 aefa3e9b9e32af26f1bf5e6cf177df36
BLAKE2b-256 b4ab87bbc89f6ee7ac0a82921b84ab968a680bb8c86539e7ba3d4507aceeb857

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 73.1 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 503cb82a1acb743190a676ab9bcfcbd4a99207b97bada15d054e31f04a5f12aa
MD5 7070f141770683b8f9feebe68f3b6e65
BLAKE2b-256 681e55c4f4db3649b83a8bd05ecee2221d796b4c85460e7c708185c1c324524c

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d03968827ba241dda7c179df56fcc048cdd8251204c0864c3267de234c9678be
MD5 ee4b5137e0cb627ada27840fd9bf6cfe
BLAKE2b-256 d179c35c3349a62a85e5c7d134b39c0726a321dad0e2faf8707e76a5b2578188

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.2-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.2-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 1d33816cf8707845cca213503b4c0d21104f5e12c71e48a9de0dd0ffb3ea1167
MD5 cba04613b7ebf9c0b645b97bf6e41883
BLAKE2b-256 cd74951c38b1cc2df77a4787d9bac9d8cba5ed8c6389031eebd8baa54ffcda97

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 73.0 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 07651d7906edf0aa7d39ff964625f2a61b6c94c24823c069cc6e51866eac505f
MD5 52614b4ff105050af8cd471f8a2f276f
BLAKE2b-256 691614a086235aae7e0128cfa4be60b05f537c95337d61378e88b20edc7db28e

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0579879aafcf6363e95ed70f43c53a9963dafc1aa1944d067c650613910a566a
MD5 8dd1d1832147e383cb089a6fe9b5f08b
BLAKE2b-256 1357e00e16177c9d16502c57de34f0699c5d7be8cf2a02f3c694cdf2f97ba2e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.2-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.2-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 66cee8f4d56c620898321bfb6ceaef05709280cb45000b9cf46526580e4902e1
MD5 2af28190add4d6f0a51151663646a604
BLAKE2b-256 b3fad608ef588066cb0f0d92fedd297a635eadd3002befc2466425246853bfc6

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 72.8 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 92defebf6bf1e3b57d9066e28c2d579a2db1a30313b3e44718372ac980b5c5e3
MD5 0327b44470741920b5f36baab18fc8d4
BLAKE2b-256 dc33ff8ea6e69360bfb458d802d9f57efbdf2498364e9ced373fb2cc784ac2be

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e1c4d92eafba63ae786149c5d5cf69b61da7e11f57b9532fe3de713fd48e8a5c
MD5 5613a91c881812168666689f11adbd17
BLAKE2b-256 a111890b00f016cf2efe93b63098da7e2b80ba9ffa1f0dabbb0bd311d177c7c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 d46d6c547090034825bc8a493677072b7d6bc898c6dae916dbcbf0e47dd3ca11
MD5 7a3567e16852446dfa1b75fa5ed5ba8d
BLAKE2b-256 c5455b3ef0e53380edbce2cd21e9791d30ca4f2129e8affb8296b1c26dde21ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 72.7 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 458eb093c3ed13432d829ee7d9d453f239cc22ffbbca98bedcd7e12799157072
MD5 0de0132785cbe1cef7c687d955349f36
BLAKE2b-256 3401b31f71415f2f8638bef5f1b580be25116a764a8e6d0f06314c400412bffa

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b724df9c6930613df84c4136fe2d1ab2baa2616c8b02eaa2769c5530dbe1c95b
MD5 073af6dc2083132b6fb263ad62315b37
BLAKE2b-256 00fdbe168187fa387dfc67f3121e88e7ee506673aecef11b0f71d620c56a0cd0

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.2-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.2-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5f3fb4e6a173ca505d8e763054182d54d0f896b91df68b385c5d3c5a0a8e34bf
MD5 bef819034f1023085455728b5b5ecfa9
BLAKE2b-256 4d46d45fa2fde4c0d2407222e2ab26a6140c936bfb89b782b0a6ebbf57d924c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.2-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page