Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 17.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.11.4.tar.gz (149.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.11.4-cp312-cp312-win_amd64.whl (74.0 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.11.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.4-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.4-cp311-cp311-win_amd64.whl (73.8 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.11.4-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.4-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.4-cp310-cp310-win_amd64.whl (73.5 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.11.4-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.4-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.4-cp39-cp39-win_amd64.whl (73.5 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.11.4-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.4-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.11.4.tar.gz.

File metadata

  • Download URL: jollyjack-0.11.4.tar.gz
  • Upload date:
  • Size: 149.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.4.tar.gz
Algorithm Hash digest
SHA256 27fe5b3a0362f4765c75639d4a631e196e2aa2f8a432b5da348ffebfdf4a222f
MD5 174e6ea1640a57d3233053c1b1db59cf
BLAKE2b-256 d5fd089295ce6dc5846b297e0b54eeb9fe871cbf549580b2c4f4e98bdeec7a83

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 74.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 3a08d1120fa032f55546e8d8e6dc98ac9dec3d4ed766df556ff5d792e7cf72ae
MD5 06f939d84e2f7342aa8a7757bf1a1418
BLAKE2b-256 841fa3b0794ef80db1145403faf31166405301fa75ed2be48e2f76c93a8a5932

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 eafd780d9c7375c2a3bf6360254af80469bec770c83ced5d0f32e82a1bbc01bb
MD5 e33867f94bc61ab2d443a4df8c6d2bc3
BLAKE2b-256 ccf044837c4eee1d8f5c426f73e78b28c5de7e2f2a3c007ed34461745c7417f4

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.4-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.4-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 26ff7d98cbe608710eb6068dabf764016c5bb9633266dec42dd96c17c7fd3f9a
MD5 a5cca06f8c7c5a032f6621d93fe0c71b
BLAKE2b-256 75af12e50fbc1a3d4fd0fe8bf238609b88e0c3df340d140207b2408dc4f155e1

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 73.8 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 730e92ad04d7e402bb90af8995d130a70d74cc869ec4b00d07473f32714ff07e
MD5 c6de2168ba16b23c562acfce32f1be91
BLAKE2b-256 51202a2e0c64ff69a662dc30903c256c3bbe5a6109c74b1bbce3139ec0d8396f

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.4-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.4-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 af81992a149e861bf6ecef74eef398cede87c1b393ead15097fdb5123c932e39
MD5 6632e3812ac48f41f010839186da4db7
BLAKE2b-256 cbda39b880b212ddb4366be50928eebbed6661017b1861c45d5dae6496a12b73

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.4-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.4-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 62ecf5fc3bcf7a2b3a087026cd4b7e037a937f820c549b4a5e2b718f3051a697
MD5 1c592aa32eb452d2a3c1e07f3132e805
BLAKE2b-256 113835717cc976720cfbb023cc9403d976795e19adc729bc5efec7ec6f64a65f

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 73.5 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 70be448b0ecad58a690df8ecbb6aa020e34b16e536bb5555e9e3e5450ad75689
MD5 f6dcd50e4deea7f8586557aa5ee9bc69
BLAKE2b-256 beace42a974b93b39d7284bb5ceb254e1ed3be00eca2b580373f6c3d2f766042

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.4-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.4-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 173caa998e8ba8a424c251ad4a4fcc8943157339c7d245026b0fc45a8710579f
MD5 b7e8b8bcddccda1a602fddebbd196cf9
BLAKE2b-256 51e7918f43fe70470b8ed6c54d784ab79b8a6b1a903d4fe0f8f436d367c9ecc3

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.4-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.4-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 04e76b38d188b840fde17865c5bf832252dc4d88c203efc8928ea9e11ffd623b
MD5 e7a6040b54b2efd307a595631226194e
BLAKE2b-256 0dd32eae0603652007c7115cf6c04b25c042b7aee3000fae226e9a0cb333d67a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 73.5 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 341ad5f1cdba590b66868926bcf5f02745cfcc0a71479e273498deb0426702b9
MD5 54a82ccb60425f9991872592282498cf
BLAKE2b-256 a66360bf77bba59bc779fc99ee76d279000eb50db69c3c1dd001415f0237ee3a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.4-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.4-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 37a41030806217ebc4d2f30ddbde24ef161e5b754b8f5ccc2498dfeb1f7e89ec
MD5 6966aca898748babdc5edd558c62b2c2
BLAKE2b-256 3d398947c4837d2df9cf0b647df012a4b8d3f4a9ef2d94cf8f526bd40526023c

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.4-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.4-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 60ed296c4dff6a4c3e5a855aef4fe0a6a43d1b1ae02afe4df307c74969566ad8
MD5 ac136ae5bb59cece360296ae754f8672
BLAKE2b-256 861874e082d85fca8b4c5d8ef87563e4e63941af5cb169f39f5e314640d1d9bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.4-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page