Parallel Offset Virtual-block Shuffle: uniform, zero-copy shuffle for large datasets, GPU-accelerated with CUDA and PyTorch interface.

These details have not been verified by PyPI

Project description

POV-Shuffle

Parallel Offset Virtual-block Shuffle

A fully parallel algorithm for efficiently shuffling large datasets without copy, while sufficiently approximating a uniform shuffle within few iterations.

Usage

import povs

# A numpy array or PyTorch tensor on CPU or CUDA device
dataset = ...

# Shuffle the dataset in place
povs.shuffle(
    dataset,  # A numpy array or PyTorch tensor on CPU or CUDA device
    iterations=3,
    options=povs.Options(virtual_block_size=2),  # Optional: specify full or partial algorithm options.
                                                 # Unspecified parameters are chosen automatically.
)

For more details see help(povs.shuffle) and help(povs.optim_options_for_dataset).

Installation

uv add pov-shuffle

Build Customization

Because of optimizations within the CUDA extension, the possible combinations of certain parameters need to be known at build-time.

See setup.py for the list of environment variables that can be set at build-time in order to add support for different parameter sets.

In order to set environment variables consistently across builds we recommend using UV's extra-build-variables. Example:

# pyproject.toml
[tool.uv.extra-build-variables.pov-shuffle]
POVS_CUDA_INSTANTIATIONS = "4,32,64,float;2,64,16,int"  # Add support for 4x32x64xFloat and 2x64x16xInt (VBlockSize x PBlockSize x InstanceSize x DType)
POVS_CUDA_INSTANCE_SIZES = "32,96,128"  # Add support for these instance sizes, in cartesian combination with supported dtypes and supported values of other parameters.

Build parameters can be queried at runtime via povs.get_build_params()

Performance

Shuffle Time

Shuffle time per deck size with 4-iterations POV-shuffle on instances of shape (128 x float16), using the NVIDIA Ada Lovelace architecture.

For a fair comparison with the algorithm, which offers close-to-uniform, zero-copy, in-place shuffling, the baseline used here is numpy.shuffle, which is also a uniform shuffle (Fisher-Yates), performed in-place and without copy.

Bias Convergence

Positional TVD bias, N-gram TVD biases and LSTM predictability bias measured for the baseline numpy.shuffle and for increasing iterations of the POV-shuffle, on a dataset of 1k distinct instances and estimating the event distributions from the observation of 3k independent shuffling episodes.

Breaking Point

Minimum number of POV-shuffle iterations required for each bias metric to converge to the observed value of the respective metric for the baseline numpy.shuffle, as a function of the deck size.

Algorithm

How it works

Partition the array into physical blocks of a specified size.
For each iteration:
- Pick a random offset, so every block start is shifted from its original position, with the rightmost blocks wrapping around the array.
- Randomly assign each few physical blocks to a virtual block, so every virtual block is contiguous by parts.
- Each worker thread shuffles its assigned virtual block in place, using a standard shuffle algorithm (e.g. Fisher-Yates).

Because there is no overlap between virtual blocks, the algorithm can be fully parallelized without facing race conditions, and doesn't require a temporary copy of the whole dataset to perform the shuffle in place.

Compared to a traditional local-block shuffle, the virtual block assignment significantly reduces positional bias, while the random offset prevents the occurrence of shuffle artifacts from the physical block boundaries.

When applied to higher rank tensors, the shuffle happens along the axis 0, with each indexable multidimensional object along that axis being treated as a flat 1D instance (e.g. for a tensor with shape (I, M, N), we shuffle the I instances, each instance being a (M, N) matrix treated as an array of length M*N).

Trade-offs

Block Size:
- Larger blocks (both physical and virtual) increase the portion of the data that needs to be loaded into each worker at each iteration.
- On the other hand, smaller physical blocks increase the total number of physical blocks, so the host program has to do more non-parallel shuffles when randomly assigning them to virtual blocks.
- Therefore, as a rule of thumb one should use larger blocks to shorten the time per iteration, or smaller blocks if the priority is reducing the data transfer to workers.
- Remarkably, so far we have observed little impact of this parameter on the amount of iterations needed for shuffle convergence.

Diagrams

Algorithm flowchart:

flowchart TD
  A([Array]) --> B["Partition into physical blocks\n(size = physical_block_size)"]
  B --> iter

  subgraph iter ["For each iteration"]
      direction TB                                                                                                                                                                                                         
      C[/"Pick a random offset\n(not a multiple of block size)"/]                                                                                          
      --> D["Randomly assign physical blocks\nto virtual blocks\n(virtual_block_size blocks each)"]
      --> par

      subgraph par ["In parallel — one worker per virtual block"]
          direction LR
          E["Read assigned physical blocks\nat +offset  (wrapping around array)"]
          --> F["Shuffle in local memory"]
          --> G["Write back to\noriginal positions"]
      end
  end
                                                                                                                                                           
  iter --> H([Done])

Sequence diagram with 2 workers shuffling 4 physical blocks (2 virtual blocks) in parallel:

sequenceDiagram
  participant A as Array
  participant W0 as Worker 0                                                                                                                                
  participant W1 as Worker 1

  Note over A,W1: Pick offset · Randomly assign virtual blocks

  par VB0
      W0->>A: Read blocks B1, B3 at +offset
      A-->>W0: local copy
      W0->>W0: shuffle
      W0->>A: Write back to B1, B3
  and VB1
      W1->>A: Read blocks B0, B2 at +offset
      A-->>W1: local copy
      W1->>W1: shuffle
      W1->>A: Write back to B0, B2
  end

  Note over A,W1: Non-overlapping virtual blocks — no race conditions

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.4

Jun 22, 2026

This version

0.1.3

Jun 22, 2026

0.1.2

Jun 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pov_shuffle-0.1.3.tar.gz (60.6 kB view details)

Uploaded Jun 22, 2026 Source

File details

Details for the file pov_shuffle-0.1.3.tar.gz.

File metadata

Download URL: pov_shuffle-0.1.3.tar.gz
Upload date: Jun 22, 2026
Size: 60.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for pov_shuffle-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`55c01b5d4daece1e097290e7a3360202b61a5ed1d68f102237d053c3c5b4377f`
MD5	`721e785725cbc9a34cc6f7eea36da1af`
BLAKE2b-256	`c3ea125807ea30a94e0477916da80fb8ea205bd18449151e1c1b67fbfb41058f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pov_shuffle-0.1.3.tar.gz:

Publisher: release.yml on lariel-fernandes/pov-shuffle

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pov_shuffle-0.1.3.tar.gz
- Subject digest: 55c01b5d4daece1e097290e7a3360202b61a5ed1d68f102237d053c3c5b4377f
- Sigstore transparency entry: 1917167671
- Sigstore integration time: Jun 22, 2026
Source repository:
- Permalink: lariel-fernandes/pov-shuffle@4eca8cf755000e8dd2cfaa18098aaa646b432424
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/lariel-fernandes
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@4eca8cf755000e8dd2cfaa18098aaa646b432424
- Trigger Event: push

pov-shuffle 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

POV-Shuffle

Usage

Installation

Build Customization

Performance

Shuffle Time

Bias Convergence

Breaking Point

Algorithm

How it works

Trade-offs

Diagrams

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Provenance