Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 17.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.11.5.tar.gz (150.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.11.5-cp312-cp312-win_amd64.whl (74.0 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.11.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.5-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.5-cp311-cp311-win_amd64.whl (73.8 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.11.5-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.5-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.5-cp310-cp310-win_amd64.whl (73.5 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.11.5-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.5-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.5-cp39-cp39-win_amd64.whl (73.5 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.11.5-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.5-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.11.5.tar.gz.

File metadata

  • Download URL: jollyjack-0.11.5.tar.gz
  • Upload date:
  • Size: 150.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.5.tar.gz
Algorithm Hash digest
SHA256 87ff3a8805a4c94d95669e340f4cd84a677a1af225a17d613f642a5c90db9765
MD5 2cc692dfd578071d4ece6879a8fce725
BLAKE2b-256 15f0586896a26866f6c7474aafb8537e11561b5871d800039089af5069f2a2c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.5-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.5-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 74.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.5-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 42229b293ba932141e5eb7cb439113d681d54c069a114e0c91e9e0dbbf442b5d
MD5 e2d17a53477905bf804eb595e5c27e47
BLAKE2b-256 9063e4511b207cfef97b8125a39ef7e5e2ecc7001738ac20e2b19c8317054a05

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 dd48fa38e0e0cebe8e392a20e0e98cb9bc5a3113113b02de9491b487cbb089f0
MD5 15ee1f2d680f5e5ea5e3ed7224ede790
BLAKE2b-256 366524f143637fb776bcdb00f76f1cd0f6657cc100c31339f7d388d5c8812223

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.5-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.5-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3a1fa2282aa29d557047de757abc2a51bf49486e5ede8467f45adbf391b69ee7
MD5 a1b885d58dfaf4495e56c3b52a15a911
BLAKE2b-256 2ec2bafc228339aeff7937a018df731538680e0618a7319280640cab18d16228

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.5-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.5-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 73.8 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 f5d3b055153ab6e291043292836b645bc91b5e58c0290c2e13f9531419f7cf89
MD5 980c35127916aa1aaaf5fddb3a7d942d
BLAKE2b-256 b0c93e72e0e81875647d51c5c9b51d64fb33054c5fea4afef56a552ce0d9a70f

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.5-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.5-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 41503b29ed2dd04c8e37d3aa7ef006da9a2e93adbf28a569f6cf15eed4ecf8c2
MD5 986451c2fa3c2b2ef9103fdcf54c0c78
BLAKE2b-256 75e6479f1c19ffa3caf807d0d4c3a68d617f7b336f2f69e8433ddc3df310fa04

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.5-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.5-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6b1376ea048d5eebf7b00b23eee5685782a4788e69b976bbe9ba778f402f2229
MD5 730f19fc5203fb09c421b27e72a8da80
BLAKE2b-256 630d648a8e71eb3694c854065422565761f110421bca38bf403f049c2017b20e

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.5-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.5-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 73.5 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 9e36083a115b38e36da406bc6473f2e64aa7b5735ee147d06ae03a4bfe8e1a49
MD5 dee46d3115302e7ae2609c8d4d82c146
BLAKE2b-256 a19c9fb0d378ce6abeab7222058e4863ebd096bb8ee979416f7062fe325231c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.5-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.5-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 93974c8b409444f917b3cd57217bbf17cec9275cd8662c91459d0532de1eea67
MD5 8d553ea893d19cbe0028e69871bc37e8
BLAKE2b-256 09fed9acacbf9c7fe558fb5523932c8a700e906aa6be94cc85552bd7e410fe1c

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.5-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.5-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 83e475fdb8dfd955fac183c7bbaf5ceb9b83f2b7757caf8bf5dc81f6878f82e4
MD5 9e32b3d0c8899c0fed8944aca8830720
BLAKE2b-256 4a915cedaf5a6e70f8841e0889ae0f8074fa55afe7db0b1daf8f58cc458c8ec4

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.5-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.5-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 73.5 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.5-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 f455dcb9614172e8ecb2060ef662d439a0b0f932b144156f6b3821132a43e568
MD5 75805d651ef43a6cb0aec2a38f55b483
BLAKE2b-256 a9900ef2f44f71d1f2bc070493948be5cd90b276d95cc38f131bb3bf1ce10486

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.5-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.5-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ada5f92814f8fd492b074c35dbfef0835d4681f4a818d2e44fb32eb4ceea4009
MD5 99a6b9a0fe311c0d85e7678ce3f94c4c
BLAKE2b-256 091a52632d758020efdda2c7632f8c9ccb8e63012e9c3bafd272ead193ce00bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.5-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.5-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 0a3b65934022078b06e05863357c1aee547c7e5042105adcd5d6f5f60ba60d50
MD5 ec60ee55fe87947399f94be1520769a6
BLAKE2b-256 65e8e780dbfb5e9d3601cee70b1dd3a6ba1cc3e0db47fe423799ed50dca5bed1

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.5-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page