Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 17.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Sparse reading

np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = [0]
                        , row_ranges = [slice(0, 1), slice(4, 6)]
                        , column_indices = range(pr.metadata.num_columns)
						)
print(np_array)

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.13.1.tar.gz (157.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.13.1-cp312-cp312-win_amd64.whl (80.6 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.13.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.13.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.13.1-cp311-cp311-win_amd64.whl (80.5 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.13.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.13.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.13.1-cp310-cp310-win_amd64.whl (80.0 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.13.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.13.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.13.1-cp39-cp39-win_amd64.whl (80.0 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.13.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.13.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.13.1.tar.gz.

File metadata

  • Download URL: jollyjack-0.13.1.tar.gz
  • Upload date:
  • Size: 157.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.13.1.tar.gz
Algorithm Hash digest
SHA256 588e2764593995b17e91a23f22ad45c36e3919dcd06f583abaccd62ebffd1e03
MD5 629367153f368ef74c6ba1c9e1db8017
BLAKE2b-256 e3aeed37b775b1eeff92fd663aa9c7bbd6d78853e4957292ffe0696bf63bbfb5

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.13.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 80.6 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.13.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 7353bd906818fee625ac0e02a0ba6aedfbfbbc9d324511c6a6bd0c5ffdba1a61
MD5 65e7bac2a4129eac9e91ea28da63f51b
BLAKE2b-256 bee812a1666e8dc4e06d0b338255d52458ec743f99f86b9073c21067ad22b7f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 57ab47fb949ce63194f2a09bdb17c7b88635bc15c6d27552f719831e81a4a1b1
MD5 54dc6a8d5615984d6916ff50e017c1fb
BLAKE2b-256 62e6c26114d2da5fdc8f3e45dd93f6a04b6efece24c293cf495084c9c5e72c3a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 50816c704f97a66221e49e5b7d4acabdc73a8fe981a8b28f971b04c90f5454af
MD5 27bae45259ce7171712418879e524eb5
BLAKE2b-256 4d53cdac784aa28211b25e90f50a6c7268542a592574d52a5d6e5b1c50b16a85

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.13.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 80.5 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.13.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7d5d79516409c282e1b7317c29d3567daf58058dfb74b83e60d226cd335a9ba4
MD5 9f9fad0a838f40d6d741bcf80e65a1af
BLAKE2b-256 2771565031d9f8296879ab0474d92e22c0d2e514e873a8efde1fec7a88c0b7a0

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 dfc7afabc68db2fc0ac8cec64efe18b20e411fe3b187cca17a34ccb0e36bc7b1
MD5 bb8c8318087e60e8238894c3dd7a4045
BLAKE2b-256 19fbab80df63e7f45572d7f8e46fa1dabf29e2752fcddaa4efb740f3767f25e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 72c48a137a4fe7568c8a778fecd5a7dec3692709f538d5ea5c8645f7dee2ed78
MD5 6421b0bd4faa8e915fa01fd4d90d9f90
BLAKE2b-256 abef5dd9a5e7fef9d19625a0c42f1d297f9e91970cc619df764a618fce4ca8f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.13.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 80.0 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.13.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 1f5c624a8a4c9828aeee67caddff37285bfff76c6d7fd080834903982403d4bd
MD5 1f4d3c180b321398e2d667911e2c62d0
BLAKE2b-256 07a8efcc5a5430be0d9507af910655ec32b17dcedfdb03c487f5aac0cd845468

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2eb57b8d0129486fa94f62619a1af8c9316b996637f15a8420f90d27b57d3046
MD5 04d86e8f32debc1bfc97850ca6c538bf
BLAKE2b-256 fb07fb0aa3e3c747c882d12ce12a2cb2a46d79d1eb67aaf8d05e94ac3ddc094e

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3ffa5c3de1be164ccae23ef46f9ca7c40288f79902487f9beccc663369e7ceaf
MD5 66ecc67aa3e2f290f17ac86fbb361e39
BLAKE2b-256 572fd94a2230ead2d6b4f7f4783a1464dc08ce73eaf8689ddc1ae2e6643f9ba4

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.13.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 80.0 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.13.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 9150ef14dc26d5b93e5a793fc7945553edf176bcd2731e2f7e97104ed23bedaf
MD5 243d168e3d8d3410f057feb109900841
BLAKE2b-256 833022183f4f9494e517f861df7563523a14aac0275e4f1073ea4fb576dc4e67

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5cdae1b88adbb3105d7f017eecc900926cadff82dfd7569f506cc67545f3f29f
MD5 834e2e548a293c416e80dfac8b8e67a3
BLAKE2b-256 1c6d4ffe57df0817fe779aa2ebe614622e6fc6d32a0302877ac33509bc5271a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 d4588f0ab99ca2e0bf7c7edbf0bb9b6628c3ba6949a741a5b6a28abc732d9eda
MD5 31e343f5fc67f7ea88c16749dfb6c70b
BLAKE2b-256 65fe6f66e27203cbb5e9d0c9dce660fa50a91e724f2a0fb36eacea6ac114244a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page