Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 20.0.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Sparse reading

np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = [0]
                        , row_ranges = [slice(0, 1), slice(4, 6)]
                        , column_indices = range(pr.metadata.num_columns)
						)
print(np_array)

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.15.0.tar.gz (160.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.15.0-cp312-cp312-win_amd64.whl (81.5 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.15.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.15.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.15.0-cp311-cp311-win_amd64.whl (81.4 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.15.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.15.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.15.0-cp310-cp310-win_amd64.whl (80.9 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.15.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.15.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.15.0-cp39-cp39-win_amd64.whl (80.6 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.15.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.15.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.15.0.tar.gz.

File metadata

  • Download URL: jollyjack-0.15.0.tar.gz
  • Upload date:
  • Size: 160.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for jollyjack-0.15.0.tar.gz
Algorithm Hash digest
SHA256 cd0fb6e0f54d81cfc1337d8dc2b3c50390edcebb0552c3ad5271371d311dd13d
MD5 e8c187f4d017069fdff0ddd33d4d3566
BLAKE2b-256 269955b77c9a055c01ef3bd73ba57c7d2ffea18b3c9bfa0777f144c92203b077

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.15.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.15.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 81.5 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for jollyjack-0.15.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 abdb356ecce853d49a80337789441d096777408eaacafbfcf01eda15f637cd52
MD5 7a281964e6787eac949f1b50a94f45a7
BLAKE2b-256 6445a0131a71eb39598bddbb310521b9bf5080c86f7fb57eec6fea066a95a3f1

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.15.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.15.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e96ad0080e2ee9fe0aa86231a99262c0ab325adadfd528399fe8c5809c6b12b8
MD5 d1806791cacbfa7fd90dd35254f96b53
BLAKE2b-256 cb8cdd3d1a9378d627ec2bd722006aab81c398597dd012d4dfda416c119abc81

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.15.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.15.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 c7da119d5b8dafcb7bd2d82e19369fc88b73321111e157d25b60067decf75834
MD5 3dd1ef23e64656d81776db889466f483
BLAKE2b-256 a84642e6bf36c4ba8b0c59e6dcae5bc9c4f94c08be0a6d2e2ecaf548c5a5ce84

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.15.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.15.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 81.4 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for jollyjack-0.15.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7f3e6c535b41d18637c79aef43475f7fe739b8d47dcf704447e5e422734360ca
MD5 22ca285140e06cd54efe29bb5edfab16
BLAKE2b-256 d74c42262562e31120651d38d00d4854623e0587c18d9a21d8c496c3ccf18615

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.15.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.15.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1758e093ebb8491e5ffa959b89ce0d7e7274a1eac4bfa1e8676edc5780efc9fb
MD5 eea4910f3018216d7ce6c5effdb33e92
BLAKE2b-256 141f2fe4c82c26f58fae64c4dacda944460c8da95ba6ece44c18139cef6a295d

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.15.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.15.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 471b05d0f482eca2eb9e1bf0498b2ed18f16f97f602529ed51fff6c4f1d5deed
MD5 4aea38873f45eca5efb49e28df405e4e
BLAKE2b-256 adc52068526fe3b95db8023d1245e5c3a178265ae888ef8d858b209f93c85a0e

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.15.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.15.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 80.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for jollyjack-0.15.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 cc82223935a14136a47da17842d6678f26bb47eac98badb174b776bac330564f
MD5 8a08771ddf8ca0d784044195b14878d7
BLAKE2b-256 547cf84e963452c4b21f114556b4e075c534e9f281af674cb795eea4f33e2f1b

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.15.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.15.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2828d726f34342e04c3f857b95d5f86cb177025fdada00b66993726967527223
MD5 1e7e50efa69ec3ba7da86843aaf0dc06
BLAKE2b-256 94431dd8e067876914d8c0ab0af45c96152e7f21e8d410a50ee412a551df737a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.15.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.15.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 fe55e78a12cfa0b039cc070fb8d5e47389a01ddf2aafaf34a5336d56d912e866
MD5 d5bf5bd55c85b6d55481ad6ec9c574b6
BLAKE2b-256 41bb15cc76cfbeb192c6d4b39ea1f3969770ab8ca7e620526b21b509c013f032

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.15.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.15.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 80.6 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for jollyjack-0.15.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 66352167a522b93c991fc5e64ba3c1d930318d824ce19c3b971a83d3a30e0140
MD5 02c083f05b9e85a89b18c26d807d35eb
BLAKE2b-256 e72cca86c2831580ccac6cb73a7fca51c257a6c08c10157ee7bf3353d113a8aa

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.15.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.15.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6a4d31c64e93c2ae2abf477e35aaa01b7eff89c0eb7fa48fc4dcd8b618f930be
MD5 c00f1e36ba59e4b0d76b8355acfa9d3d
BLAKE2b-256 fc3576f5d6bdda4b021ed2ee5511191e197c5abe55793ce5c921da025a019f1d

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.15.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.15.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 12a48b3ad896a82ebb7b7cf1aba7d096f82648266450e1be75d55b1cf4d06e1c
MD5 cd3c4057930522ea3035c48171020dc8
BLAKE2b-256 8ee9fd44e8803462896b3123ec01867736ce156b1c0f52de23fa2fc9d830b187

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.15.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page