Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 17.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.11.1.tar.gz (147.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.11.1-cp312-cp312-win_amd64.whl (73.1 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.11.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.1-cp311-cp311-win_amd64.whl (73.0 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.11.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.1-cp310-cp310-win_amd64.whl (72.7 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.11.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.11.1-cp39-cp39-win_amd64.whl (72.7 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.11.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.11.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.11.1.tar.gz.

File metadata

  • Download URL: jollyjack-0.11.1.tar.gz
  • Upload date:
  • Size: 147.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.1.tar.gz
Algorithm Hash digest
SHA256 17b17de379b12e4d1a7283325312118a54572590c35f05bad0737cc5bbdba6ac
MD5 fe44adf35a0ee831e74b12a053401030
BLAKE2b-256 7a2b01669bb918307a4068c00c540306ef9395ef1f83bd51b4d641558e0ecd36

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 73.1 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 493414c2608eec106620fed1e04cd357f9f673a9378aad06af2bfbc9607e4c6e
MD5 faa6085281431ac356e56e70cd32bc59
BLAKE2b-256 a9c2a3de9b273cd00402c7ca77c7e134530a9cc03d8c7b62ab379160304b98fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8285c518d515cc7cc6fcae17190c2b7850fe4a51c598fb4d4da9dffa4aae4724
MD5 dd0638f7d61ab1c8264246322f9a23f0
BLAKE2b-256 6fd629ff97addeb0282b1c46187164af50f130eeb06cc2865d655ffc076ec5be

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 206f9805d2159f07497843a6a44452efd1e4427a7651b2b41c83f9e3779a58a5
MD5 3cb7718fada64df4113fcf026ccda500
BLAKE2b-256 ebc1db005eb6e82dfb1a0a978ec0d1394b5b199a126774d42becf625da446b67

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 73.0 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1f333a837821eaa062bd3a69d9ea2cdd714ba3d1d2c7fa03aae2f4c9b805a3bd
MD5 1204022f6d87e25afb96bbb22fb6121a
BLAKE2b-256 c994a87f972b4994370841cb8153bef4fa43b1adf3ec49508a9cc08652a2fd21

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 28757cc43df7177ef3d3e5d3436809965becf1d65c7a06103c4fb4a65075af48
MD5 219edae9dda330daa462da4fbc02f76e
BLAKE2b-256 47cf5d807864b56c136901d18c184b00508ab3f85cf1bb661cb66d86eb284ac2

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 97f529e0f468a4db55e66a2f87cbec9dd21fe2491c75aa77fc8c72dfeab2d2fd
MD5 3f91ebd71b5a805faf9487c3b4b574f7
BLAKE2b-256 33303ad63b917ce8fb73ee67e4fb38538843d9eb55e944d76aa9096a956c183f

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 72.7 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 769f7889d937a1a963cb1c2d03d6b49eb8ef92371f8dbf11a828ccecfa07b610
MD5 6e4d5d1a5e9a2c3edaac6bf205780aaf
BLAKE2b-256 af08b8d5f3d03b8e8a34cb5a760956a03801a1c2f4ba5d8c1558c27365559d0b

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4279106fb5f9e28d37a1f7177d9c4cf89ad9948685253931087e8420259ee4c1
MD5 9895ad4f4f7f88c583475314396ef4b3
BLAKE2b-256 260a4be3f049dc4113df381c271bd9a1ca3853795b1d0fa5db49225ec98f216c

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 9dcfa8adb92554bae8e0f1715d464d24cb57a48ff3bb4b66c2a7ff16e8bbb128
MD5 65a86ad65044fdb2c8543e4990564328
BLAKE2b-256 ff35745ac3fa54d4797b38f36de6686584324461e9c3523896a8870ad61e008d

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.11.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 72.7 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for jollyjack-0.11.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 e70e67253303445b1b8ea7064731cf581f63766e784824af5974f23e95165c36
MD5 49ac3f7fdce926d635a12aa86547c175
BLAKE2b-256 cda4b9acf2fc8b2aeb2c6c35c3f31914b98338a23107752b80ca342d19c95bed

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3d1819b6a162f7d0620bf405be0f519d3016c72324529979638d037ab484f1c8
MD5 3dbf472d34498a2864f6dc055057288e
BLAKE2b-256 0fcf89f094f030757dcfdca35a2f7cfd8ce425114b08cd695b8d24b3889a67ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.11.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.11.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5dfdd8966ed247923ea3927cc18ad4a87917cbc66c991364b0e7ad0d11288ee4
MD5 04b61b828d2306a6f32b188130a195cf
BLAKE2b-256 53c606f1e6640f34a2deb433c2806b1ea0eca7e2bdf829ccf19235a3b40338ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.11.1-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page