Skip to main content

Parallel Offset Virtual-block Shuffle: uniform, zero-copy shuffle for large datasets, GPU-accelerated with CUDA and PyTorch interface.

Project description

POV-Shuffle

Parallel Offset Virtual-block Shuffle

A fully parallel algorithm for efficiently shuffling large datasets without copy, while sufficiently approximating a uniform shuffle within few iterations.

Usage

import povs

# A numpy array or PyTorch tensor on CPU or CUDA device
dataset = ...

# Shuffle the dataset in place
povs.shuffle(
    dataset,  # A numpy array or PyTorch tensor on CPU or CUDA device
    iterations=3,
    options=povs.Options(virtual_block_size=2),  # Optional: specify full or partial algorithm options.
                                                 # Unspecified parameters are chosen automatically.
)
  • For more details see help(povs.shuffle) and help(povs.optim_options_for_dataset).

Installation

uv add pov-shuffle

Build Customization

Because of optimizations within the CUDA extension, the possible combinations of certain parameters need to be known at build-time.

See setup.py for the list of environment variables that can be set at build-time in order to add support for different parameter sets.

In order to set environment variables consistently across builds we recommend using UV's extra-build-variables. Example:

# pyproject.toml
[tool.uv.extra-build-variables.pov-shuffle]
POVS_CUDA_INSTANTIATIONS = "4,32,64,float;2,64,16,int"  # Add support for 4x32x64xFloat and 2x64x16xInt (VBlockSize x PBlockSize x InstanceSize x DType)
POVS_CUDA_INSTANCE_SIZES = "32,96,128"  # Add support for these instance sizes, in cartesian combination with supported dtypes and supported values of other parameters.
  • Build parameters can be queried at runtime via povs.get_build_params()

Performance

Shuffle Time

Shuffle time per deck size with 4-iterations POV-shuffle on instances of shape (128 x float16), using the NVIDIA Ada Lovelace architecture.

For a fair comparison with the algorithm, which offers close-to-uniform, zero-copy, in-place shuffling, the baseline used here is numpy.shuffle, which is also a uniform shuffle (Fisher-Yates), performed in-place and without copy.

alt text

Bias Convergence

Positional TVD bias, N-gram TVD biases and LSTM predictability bias measured for the baseline numpy.shuffle and for increasing iterations of the POV-shuffle, on a dataset of 1k distinct instances and estimating the event distributions from the observation of 3k independent shuffling episodes.

alt text

Breaking Point

Minimum number of POV-shuffle iterations required for each bias metric to converge to the observed value of the respective metric for the baseline numpy.shuffle, as a function of the deck size.

alt text

Algorithm

How it works

  1. Partition the array into physical blocks of a specified size.
  2. For each iteration:
    • Pick a random offset, so every block start is shifted from its original position, with the rightmost blocks wrapping around the array.
    • Randomly assign each few physical blocks to a virtual block, so every virtual block is contiguous by parts.
    • Each worker thread shuffles its assigned virtual block in place, using a standard shuffle algorithm (e.g. Fisher-Yates).

Because there is no overlap between virtual blocks, the algorithm can be fully parallelized without facing race conditions, and doesn't require a temporary copy of the whole dataset to perform the shuffle in place.

Compared to a traditional local-block shuffle, the virtual block assignment significantly reduces positional bias, while the random offset prevents the occurrence of shuffle artifacts from the physical block boundaries.

When applied to higher rank tensors, the shuffle happens along the axis 0, with each indexable multidimensional object along that axis being treated as a flat 1D instance (e.g. for a tensor with shape (I, M, N), we shuffle the I instances, each instance being a (M, N) matrix treated as an array of length M*N).

Trade-offs

  • Block Size:
    • Larger blocks (both physical and virtual) increase the portion of the data that needs to be loaded into each worker at each iteration.
    • On the other hand, smaller physical blocks increase the total number of physical blocks, so the host program has to do more non-parallel shuffles when randomly assigning them to virtual blocks.
    • Therefore, as a rule of thumb one should use larger blocks to shorten the time per iteration, or smaller blocks if the priority is reducing the data transfer to workers.
    • Remarkably, so far we have observed little impact of this parameter on the amount of iterations needed for shuffle convergence.

Diagrams

Algorithm flowchart:

flowchart TD
  A([Array]) --> B["Partition into physical blocks\n(size = physical_block_size)"]
  B --> iter

  subgraph iter ["For each iteration"]
      direction TB                                                                                                                                                                                                         
      C[/"Pick a random offset\n(not a multiple of block size)"/]                                                                                          
      --> D["Randomly assign physical blocks\nto virtual blocks\n(virtual_block_size blocks each)"]
      --> par

      subgraph par ["In parallel — one worker per virtual block"]
          direction LR
          E["Read assigned physical blocks\nat +offset  (wrapping around array)"]
          --> F["Shuffle in local memory"]
          --> G["Write back to\noriginal positions"]
      end
  end
                                                                                                                                                           
  iter --> H([Done])

Sequence diagram with 2 workers shuffling 4 physical blocks (2 virtual blocks) in parallel:

sequenceDiagram
  participant A as Array
  participant W0 as Worker 0                                                                                                                                
  participant W1 as Worker 1

  Note over A,W1: Pick offset · Randomly assign virtual blocks

  par VB0
      W0->>A: Read blocks B1, B3 at +offset
      A-->>W0: local copy
      W0->>W0: shuffle
      W0->>A: Write back to B1, B3
  and VB1
      W1->>A: Read blocks B0, B2 at +offset
      A-->>W1: local copy
      W1->>W1: shuffle
      W1->>A: Write back to B0, B2
  end

  Note over A,W1: Non-overlapping virtual blocks — no race conditions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pov_shuffle-0.1.3.tar.gz (60.6 kB view details)

Uploaded Source

File details

Details for the file pov_shuffle-0.1.3.tar.gz.

File metadata

  • Download URL: pov_shuffle-0.1.3.tar.gz
  • Upload date:
  • Size: 60.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for pov_shuffle-0.1.3.tar.gz
Algorithm Hash digest
SHA256 55c01b5d4daece1e097290e7a3360202b61a5ed1d68f102237d053c3c5b4377f
MD5 721e785725cbc9a34cc6f7eea36da1af
BLAKE2b-256 c3ea125807ea30a94e0477916da80fb8ea205bd18449151e1c1b67fbfb41058f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pov_shuffle-0.1.3.tar.gz:

Publisher: release.yml on lariel-fernandes/pov-shuffle

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page