Skip to main content

Read parquet data directly into numpy array

Project description

JollyJack

Features

  • Reading parquet files directly into numpy arrays and torch tensors (fp16, fp32, fp64)
  • Faster and requiring less memory than vanilla PyArrow
  • Compatibility with PalletJack

Known limitations

  • Data cannot contain null values

Required

  • pyarrow ~= 17.0

JollyJack operates on top of pyarrow, making it an essential requirement for both building and using JollyJack. While our source package is compatible with recent versions of pyarrow, the binary distribution package specifically requires the latest major version of pyarrow.

Installation

pip install jollyjack

How to use:

Generating a sample parquet file:

import jollyjack as jj
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

from pyarrow import fs

chunk_size = 3
n_row_groups = 2
n_columns = 5
n_rows = n_row_groups * chunk_size
path = "my.parquet"

data = np.random.rand(n_rows, n_columns).astype(np.float32)
pa_arrays = [pa.array(data[:, i]) for i in range(n_columns)]
schema = pa.schema([(f'column_{i}', pa.float32()) for i in range(n_columns)])
table =  pa.Table.from_arrays(pa_arrays, schema=schema)
pq.write_table(table, path, row_group_size=chunk_size, use_dictionary=False, write_statistics=True, store_schema=False, write_page_index=True)

Generating a numpy array to read into:

# Create an array of zeros
np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')

Reading entire file into numpy array:

pr = pq.ParquetReader()
pr.open(path)

row_begin = 0
row_end = 0

for rg in range(pr.metadata.num_row_groups):
    row_begin = row_end
    row_end = row_begin + pr.metadata.row_group(rg).num_rows

    # To define which subset of the numpy array we want read into,
    # we need to create a view which shares underlying memory with the target numpy array
    subset_view = np_array[row_begin:row_end, :] 
    jj.read_into_numpy (source = path
                        , metadata = pr.metadata
                        , np_array = subset_view
                        , row_group_indices = [rg]
                        , column_indices = range(pr.metadata.num_columns))

# Alternatively
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = range(pr.metadata.num_columns))

Reading columns in reversed order:

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = {i:pr.metadata.num_columns - i - 1 for i in range(pr.metadata.num_columns)})

Reading column 3 into multiple destination columns

with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = range(pr.metadata.num_row_groups)
                        , column_indices = ((3, 0), (3, 1)))

Sparse reading

np_array = np.zeros((n_rows, n_columns), dtype='f', order='F')
with fs.LocalFileSystem().open_input_file(path) as f:
    jj.read_into_numpy (source = f
                        , metadata = None
                        , np_array = np_array
                        , row_group_indices = [0]
                        , row_ranges = [slice(0, 1), slice(4, 6)]
                        , column_indices = range(pr.metadata.num_columns)
						)
print(np_array)

Generating a torch tensor to read into:

import torch
# Create a tesnsor and transpose it to get Fortran-style order
tensor = torch.zeros(n_columns, n_rows, dtype = torch.float32).transpose(0, 1)

Reading entire file into the tensor:

pr = pq.ParquetReader()
pr.open(path)

jj.read_into_torch (source = path
                    , metadata = pr.metadata
                    , tensor = tensor
                    , row_group_indices = range(pr.metadata.num_row_groups)
                    , column_indices = range(pr.metadata.num_columns)
                    , pre_buffer = True
                    , use_threads = True)

print(tensor)

Benchmarks:

n_threads use_threads pre_buffer dtype compression PyArrow JollyJack
1 False False float None 6.79s 3.55s
1 True False float None 5.17s 2.32s
1 False True float None 5.54s 2.76s
1 True True float None 3.98s 2.66s
2 False False float None 4.63s 2.33s
2 True False float None 3.89s 2.36s
2 False True float None 4.19s 2.61s
2 True True float None 3.36s 2.39s
1 False False float snappy 7.00s 3.56s
1 True False float snappy 5.21s 2.23s
1 False True float snappy 5.22s 3.30s
1 True True float snappy 3.73s 2.84s
2 False False float snappy 4.43s 2.49s
2 True False float snappy 3.40s 2.42s
2 False True float snappy 4.07s 2.63s
2 True True float snappy 3.14s 2.55s
1 False False halffloat None 7.21s 1.23s
1 True False halffloat None 3.53s 0.71s
1 False True halffloat None 7.43s 1.96s
1 True True halffloat None 4.04s 1.52s
2 False False halffloat None 3.84s 0.64s
2 True False halffloat None 3.11s 0.57s
2 False True halffloat None 4.07s 1.17s
2 True True halffloat None 3.39s 1.14s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jollyjack-0.13.0.tar.gz (157.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

jollyjack-0.13.0-cp312-cp312-win_amd64.whl (80.1 kB view details)

Uploaded CPython 3.12Windows x86-64

jollyjack-0.13.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.13.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.13.0-cp311-cp311-win_amd64.whl (79.9 kB view details)

Uploaded CPython 3.11Windows x86-64

jollyjack-0.13.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.13.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.13.0-cp310-cp310-win_amd64.whl (79.4 kB view details)

Uploaded CPython 3.10Windows x86-64

jollyjack-0.13.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.13.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

jollyjack-0.13.0-cp39-cp39-win_amd64.whl (79.4 kB view details)

Uploaded CPython 3.9Windows x86-64

jollyjack-0.13.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

jollyjack-0.13.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file jollyjack-0.13.0.tar.gz.

File metadata

  • Download URL: jollyjack-0.13.0.tar.gz
  • Upload date:
  • Size: 157.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.13.0.tar.gz
Algorithm Hash digest
SHA256 30d50ab4c8f2e9a570623a4089cfe1980622a2994d8c30774a922535e551a614
MD5 03f80f04eeb138a8f029f33a5180e474
BLAKE2b-256 321dbef2452f62c712ce988169d6b021fb7f9713ead7e0aab69293c95e348c9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0.tar.gz:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.13.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 80.1 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.13.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 42c0abe160cdd2d931edbd75a6e9e5a3ba3f43d565679f16447ffd3b8824652b
MD5 6af6829ea96a3573e72efcf3ae119161
BLAKE2b-256 86126491773590ea58d6664ffec8786e3ac65446475e7458ec1637f7e8906c64

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0-cp312-cp312-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9ac575f1d75d393fb572ffa71ecec2522e572511af5b1a0e2c0d51fbcf4c061e
MD5 526a7b05e75bbc558e1c9d922b4b5e43
BLAKE2b-256 7b44b37a2aea4e1d3bc64ab12392073c826efc1f75060d3669956506740ad146

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 1a5b60c283fa574ba20316ed63b037ed5271c79b5035a3129c6f0b8d36a81136
MD5 45cf765a59a4b3bc9a504a0b00bf1840
BLAKE2b-256 102e20adb23cc0ad2a3acca7cdf87a50e3d3e3029af213a30d80e7bb89000ca1

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.13.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 79.9 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.13.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 18ed47eb17e2f6b639de09a86ca4dd2a6ce3c5b7ff2210fdc9cbb60dd3bdb622
MD5 d7529409899f746b03cb97c3577b9a8e
BLAKE2b-256 9d35db96dc5a872580d30983c5710a12b533ca13437c4c9f16bd71a5f560e237

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0-cp311-cp311-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fec7d03fc5589922df9ae80715669ab9d01e438faec078fd95ee2fa8eadabc2a
MD5 36503af13ccfde9ce65285fa031def2e
BLAKE2b-256 2e3ffb4fdc39ff342f7658152890d9e102d953b34ba0262863b44b490770db25

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 da6e70aadc9d37d4a61d30ddf6d02e96a509548ab5ed4d2fe2594c6372b109ca
MD5 eae4708ad7fc4c23e09e0e9d74fd9448
BLAKE2b-256 28112c762b5da7e8c10a08ceae1d525fa8e50d1a771b17a1cd3ac8e0dbc1ed38

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.13.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 79.4 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.13.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 24e694f6cd238dabcc05c8ce4d8819479782c0718d2d9902e19ef53655901407
MD5 60c438ddb2898362c848b9d3a151d63c
BLAKE2b-256 c9c14f95b99e124e0ef0943cb9c5507f5e40dc7ddc9c173b032549c8cfaaeb05

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0-cp310-cp310-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b95dad958bce80ec12e2bed4d06a0f098bce43bcce9dc03d9e1d34c5620e799b
MD5 ae402137fdf3aa17cbe994f38bbd4afb
BLAKE2b-256 097dfe50ec2294c816b99f49d6758282306a558c2c47466331a9b4bef40ae926

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6005f5604b34ff54f4f2912deb3b597088730aaa34ae5b20decbdaf9366867ac
MD5 bce6378eb9f6fddd147954e9676025c6
BLAKE2b-256 00d656de084836b0b93e4c6e3c59a59895f4ebc94e546bf763125280085f2d44

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: jollyjack-0.13.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 79.4 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for jollyjack-0.13.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 d54eca22331db970eda713820714bbdef92f3bc99d27f9be5ee3a5e00ec56ef7
MD5 e8f83adba2bda1dcdaa059b5975b06c6
BLAKE2b-256 bd2ed045cf57dd9746772ac5b76d3f2178f687fdaee948c41d9b892e3e037b17

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0-cp39-cp39-win_amd64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 cc8ef9e0b81f45e94909d6a379b8759b7dfd07e39cc4d478fd9e59be92d00591
MD5 f20153967072d88b999d93932fa82f33
BLAKE2b-256 8bb97672c5cd9661491ea04a9aef1ed2550267c42892ab6e690d0730cdb0203f

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jollyjack-0.13.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for jollyjack-0.13.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8c19450e74095c1bb0288e362ae1d7fecf5b6fc63a81868df99c57a7eedf735b
MD5 e9535dcd1296c7d5b445f59f87d98f1b
BLAKE2b-256 787f4538f57554fa7f051421ccb8ba33a882d5d30866978b5a031252c43db24c

See more details on using hashes here.

Provenance

The following attestation bundles were made for jollyjack-0.13.0-cp39-cp39-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl:

Publisher: python.yml on marcin-krystianc/JollyJack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page