Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 19.0.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Sparse reading

np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = [0]
                        , row_ranges = [slice(0, 1), slice(4, 6)]
                        , column_indices = range(pr.metadata.num_columns)
						)
print(np_array)

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.14.2.tar.gz (159.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.14.2-cp312-cp312-win_amd64.whl (81.3 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.14.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.14.2-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.14.2-cp311-cp311-win_amd64.whl (81.2 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.14.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.14.2-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.14.2-cp310-cp310-win_amd64.whl (80.7 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.14.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.14.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.14.2-cp39-cp39-win_amd64.whl (80.4 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.14.2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.14.2-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.14.2.tar.gz.

File metadata

  • Download URL: jollyjack-0.14.2.tar.gz
  • Upload date:
  • Size: 159.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.2.tar.gz
Algorithm Hash digest
SHA256 7d904d8e360d683c2e800e9897977d58a93b56e8a946019cb80b54a6b124903b
MD5 fdb93b3223024e782e1468bbea9baf83
BLAKE2b-256 28d1f00cfc08e87efbe546cb32edf1c446d6828335ee88a6a27f8975ab17bcd0

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.14.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 81.3 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 8cce5004bd411e538f7521b3f7192d693551ddc4cbd118db8b027c5c5cc27f33
MD5 fb179902c6b53373f17360cf05372ad8
BLAKE2b-256 92c7e7cdd56c05961120210676b9dd442f984f32b1a71aeb7c5bb26ef8cfc596

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8528e8a5969f61fca27912a33449e697bcf9f3bdcf42c500fcb4c0528840d89c
MD5 682d4045074f38fc5fcdde1b1fa4ea39
BLAKE2b-256 1e18ec0d9faafc942ef5f1bf42b6ab8f923776cd28b224777ac31c780a12f81a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.2-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.2-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 781ad8bdef444a1d782878e2b2b3c3684b382f5442b0be97169b1a064e9c8d29
MD5 225297648cc5823fd4fd2218c46214a7
BLAKE2b-256 846302592ef7122a3e720cc0a6483a6d0e70d6a2a20ec60f59ff4d9e42125804

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.14.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 81.2 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7e2c42bc6f47d5c1cc55fdde8a241f57c8bcb12f7b57893d04a0eea65b17d0d7
MD5 a93d888cc890ff0a3dafa684623c051a
BLAKE2b-256 65b711aabc25ba6e7797581989dcc39009a8f063f834d7c7e055b6bc9d9a3132

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d6af23aad8962e0ccc40f3cae059c06bdaa29508506c5decabe43f596b24eb28
MD5 837223c3a944582cacc4bceb8b595496
BLAKE2b-256 20312ac5e4ee3d8979edf504efdc0e63e54bb3b7f707ca479692668991f5e612

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.2-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.2-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 1cfa5709372d4d8090f792c83ec44386582f50f4e35d08dc64427274b8ad34b1
MD5 3b0523cc044818bd2b0ff1acaf5a369e
BLAKE2b-256 5496b22e17868c10278d9cffcf38e5c9779faebe8ea6134caef17c400739b453

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.14.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 80.7 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 796dbd62ffc0ab82e23b7794a528287dc46d0c428d8e8ec2590e556890d236dc
MD5 cc4ca2d173ff24a1bd27a5cb105f74b6
BLAKE2b-256 5ad505656045c33ff2e0925cb077055c7dba4f795af3d5f30940cbcf03e77f22

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 44847a7d458ced9ab7fa4b0c9e04229b7dddb40940b90d2192006b40034c185e
MD5 2d05d0159de9985af9df20abf307b2a1
BLAKE2b-256 bd927e17744d73b0e0a7c57a600470d96364481578cf58ecd928be59afafcd26

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7f1883f3f549900c28ba1e0069dce7b9b889508b0e0f18843d9b108d4e0af3b1
MD5 14dceeeaebe774c296d26e4359a013a7
BLAKE2b-256 c1928abf00c6ee110f42b7c95e68f923295f667550bbc30a7a5243503950472f

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.14.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 80.4 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for jollyjack-0.14.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 cfa63f7ba579ef0232cb8568151700076e6884caef802e6b17a87893d69cbbc6
MD5 a0b114a3581ae54b151b8559a3d2f20c
BLAKE2b-256 d7291ca39bdf8fd9cb85da8848ba75d46f008c4b1da7839cac053fc170dcaf30

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9a72856fa1844a553057ef9675ff31db90dee89ddda733e565f160334db70a2e
MD5 a9b06e6d619da4b5bf38b2011f5b35b3
BLAKE2b-256 d02270a7ca9c6413b90376538d27ed85a3196cd6c4cc927bdc4cd84d3107d30a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.14.2-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.14.2-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 f59f4ca64627ea38f5439bb652f62f6d7e62e518a324cfb63709ca480e40e3c5
MD5 d25e23cc4e74521309aec5527e4edf35
BLAKE2b-256 a77416947848c92bc9dec999ea87ae77564b1b7422d927f551b9975facb896c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.14.2-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page