Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 17.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.11.3.tar.gz (149.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.11.3-cp312-cp312-win_amd64.whl (74.5 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.11.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.3-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.3-cp311-cp311-win_amd64.whl (74.1 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.11.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.3-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.3-cp310-cp310-win_amd64.whl (73.9 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.11.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.3-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.3-cp39-cp39-win_amd64.whl (73.8 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.11.3-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.3-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.11.3.tar.gz.

File metadata

  • Download URL: jollyjack-0.11.3.tar.gz
  • Upload date:
  • Size: 149.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.3.tar.gz
Algorithm Hash digest
SHA256 9bb038467c7c5a117cceebd09e68e2632c0bad58b508df7b2ab66d922f8b1802
MD5 34b5649202dc640e0c0918c2ee167fa5
BLAKE2b-256 0ff0076a230e7b9f310a6d64ac7205ca034516a8c231ae39729cd06c7edd3983

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 74.5 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 1486481041f07a3514bf27945566eab6230ca3025a5e7315b77279fafed2fefa
MD5 5181be43233115d891983d81bd65883f
BLAKE2b-256 fd6f4126154fdc6b1596cde4297a6c96c088fb80c34ce477955f5917e0a5c764

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bf6e5148dffd5f853e7b6869cf51054844b4f172779c238e5475e95404d20a34
MD5 3f555a2f4a4c92417220528d19a42227
BLAKE2b-256 07ecb6a5d41ef5de35865d2d76e529855bd52b7918efe3a2fe4d17f3d62f7d48

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.3-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.3-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 bc810f786fca4d5425f615faedfae6ce277886e484f445a77abbd4882a182461
MD5 3caf6a1424b32026ab950dc359f299cb
BLAKE2b-256 636ddb7965f8b54433782c2ce42cdbf20c4813cdd1be2dfa181128d3de695758

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 74.1 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 99a9b37312464e073d5138c053524bdae67b375ca8fad7f6e1b45467bfc83aac
MD5 17ff2f12f9faf897450b4111b9def37b
BLAKE2b-256 def28850b4ec98f0d6b00c2f65b761dee85822519774b63774a68b47b97e0a12

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 76739028fe5de30c23cbd7af5d8eb4bf341a0153849b22d09743ca7d990944b0
MD5 028a3625408cd2fe4b80e1b1a709481c
BLAKE2b-256 6d2c21ccf9d9d88bb1171553ee5569479354a67f60df6e492e8fffffe7adbf12

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.3-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.3-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 4614787cdb241c931de420dc5573f606d8b3f4aef3b6af518ef85bf0654f9018
MD5 f4b199e959279d9caa1468a61e197fad
BLAKE2b-256 e002c47d1ea89f7b301b4f98450e2f67faf62e0c1ee6cd2626f2ea2f5a03987e

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 73.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 4a5990a8f9777a327ad113487f3205417821529084d4e316e88eabdc7fe15c9f
MD5 372b5cb0b61590fb1a8db9f8548c198f
BLAKE2b-256 d378b68ed48ebec95c6552d03b73ebf956a533285b3d902cae47631648d33431

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 335895211f6c825d697b9c9f56fa7d410fc2d36929cb8a2a76fba58906d46c44
MD5 96309186c3f70ce2772050e67bac7781
BLAKE2b-256 6d25c7b97dc8d141810e6560aea3b01e48dbe8c714267e522be79bd9cee0b94a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.3-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.3-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 4cc27e09a57e9ac2f745ef27a2d161d085446f9e5be77448b7fa1bb61e1dd0ee
MD5 71187b1bfde5d95fbeb856379dd71085
BLAKE2b-256 3cd1469553be7dbe6c7a6ea7acca7ba1ac44ee2f2804b34d346ec95214457601

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 73.8 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 1c5ebb1c7b0c2fce9bd84ddd85230d55496876f845e0a3dd4359f898ca048677
MD5 38b3d5de3e114cf5bc71c58a6aaf8211
BLAKE2b-256 4f0fb9db63e060c4c456a7e3f2eb5b3b0a4b8b5a35822f5e86569f9c976c0e4a

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.3-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.3-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4b6a2362e09e11c0c6b480c8f2cabeac43ba70f65c29b9eae9d8879ecb4e7c88
MD5 cab5efcfc4cc62ebe5bcd119aa1adcba
BLAKE2b-256 7b254508666c01b32711865611880fbb07f570cf164e3e4e36507587389679db

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.3-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.3-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 a952abb1cfd5963287dbc13b8b1cf32b8bc09f159358926c00e0ce2b7e71f252
MD5 c54fad30ae2c2a99ca844b5973656274
BLAKE2b-256 4b7dc63eaeeefa8132570c67b0940243ecf08345587174f2e7d7bfad22f7feb1

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.3-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page