Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 19.0.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Sparse reading

np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = [0]
                        , row_ranges = [slice(0, 1), slice(4, 6)]
                        , column_indices = range(pr.metadata.num_columns)
						)
print(np_array)

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.14.0.tar.gz (158.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.14.0-cp312-cp312-win_amd64.whl (80.8 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.14.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.14.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.14.0-cp311-cp311-win_amd64.whl (80.7 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.14.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.14.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.14.0-cp310-cp310-win_amd64.whl (80.2 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.14.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.14.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.14.0-cp39-cp39-win_amd64.whl (80.1 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.14.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.14.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.14.0.tar.gz.

File metadata

  • Download URL: jollyjack-0.14.0.tar.gz
  • Upload date:
  • Size: 158.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.0.tar.gz
Algorithm Hash digest
SHA256 3ee45948685cf530754ce1ace6e54f33c3b519896678783417e0bbe7e872ce97
MD5 65a0705c3165fc5c7dc32eb782f42eb5
BLAKE2b-256 0fa8b38a3c55199380e551ec57d6a9e737272ea4a57e2f6687a30e7263a4b84c

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.14.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 80.8 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 393eaca17dd9b36c8be8cc96acc6f5fb5e7e38dd45a4a8ceb700f3dc343d33fb
MD5 446c754566c0c867596cf93322f4497c
BLAKE2b-256 aed74ced87cfa17c85a51a35bbcdc9dc63527620e4561eae0de32102c250d60a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 94710e847f86b06ebd01f019898e65530e7f79e68d59cae591d0d940c1455393
MD5 89e0ee0a40df4507a1d0bc7ed1782669
BLAKE2b-256 69baad7ae427ec5c490fbff60106b77b5ffaf71c84d961995a431d5f893f14e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6de7ee4e9d322af3a57f282fcd67a1534de004b425a52a1c9994885500b1a0ac
MD5 a27d82492b9ae13868ebab730b7afe4c
BLAKE2b-256 7aaabe1675575d2b3ef9a0a2ef11b65dab69bd3c0f97e41c9e1e36d3b9ee3d4e

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.14.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 80.7 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 c99c6178adfac7cd4ec21dc0a5dba469c235ee86360745275c08771fa9277e7d
MD5 9efe4bcd3a18676400c51d6588dfd94b
BLAKE2b-256 b961d82edbfa61045be0e8a5bfeace7477ffdc3d47301a236752a4a71c2ba51d

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 24e355e16fbaf194916949ebf62f21feffcbd40f8c5723742362577399f20721
MD5 5c4f85cf4df8d68c3c3bd07b8236a6e1
BLAKE2b-256 eaf90ba308fe2899b160d2fdda7aaee748844ee7dd08f97039e060947a2f536e

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 84bbcf0e5f249b40a67e15d1cd70c27c2d14dbcde5fe075cb7db04db6f7e462f
MD5 252a0563e7b7292ae15bc0b48845a677
BLAKE2b-256 e705d0c4e63a11e49fc4332513e734735f1b8c5bf0012193cef777cd8f3550f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.14.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 80.2 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 6c79f840b104f93952a5dfd7ad5dd110219d6ca0c11fe9707d005be672907254
MD5 40c2841286bceaf2fb87fd3061196140
BLAKE2b-256 d20e8927b48e03c9c9499f96d405049f3ff8c9b8c3e03406c7a4a0daab02dcb6

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d2768d144255ce14dc1df75e7c29ba38198b000aa205a393d60cf4960f494007
MD5 b47f6567c1a472f23f92dfc68775d526
BLAKE2b-256 4f935e5cedba6dd6f0632fbdca91dfc7635a733ec546ae828f289fbdc0d16fc7

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 d4091fbfeb010f748c696c9862b265569f7e238cd063162b00ee51c164c14901
MD5 3f2de3f5fa6a75eaa49d93512f25a015
BLAKE2b-256 21ca95a2214901b3da659d74097faa8013237beb0482729b1dc04a8c53f88cef

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.14.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 80.1 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 1c8a8ba6183e84cd68b94e1b5095dcdb02043b80799a0c6bcb8d4db9aaee6fa6
MD5 27dce7c067d05a4b614f4f04806e6837
BLAKE2b-256 da75f4b8a2dab786c6b0609dcaf2e9d3e9e5b4265edcc82c0673142bc1afb3bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 03aad030ec866cb74ea086917960747c642659e60348588e627847c4b82f52a4
MD5 2dd018274652de2eff448231a119c308
BLAKE2b-256 5136d75e52dcc6e68221e72ce9b674a2636879d91f899e6e9fabded18a493b92

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3dd58e23306c09968f33656e6f64a0edbb3f818db6648ef642cd014c80d5ce9f
MD5 d68cff069555872ffbc4b4057839f1dd
BLAKE2b-256 4fd66c3099334cbf1789a9ca1dcfbe8611362d8df15e1f384a54fbdd533c0679

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page