Skip to main content

Efficient O(1)-space pseudorandom permutation; high speed shuffling for large data

Project description

smallperm

PyPI

Small library to generate permutations of a list of elements using pseudo-random permutations (PRP). Uses O(1) memory and O(1) time to generate the next element of the permutation.

>>> from smallperm import PseudoRandomPermutation
>>> list(PseudoRandomPermutation(42, 0xDEADBEEF))
[30, 11, 23, 21, 39, 9, 26, 5, 27, 38, 15, 37, 31, 35, 6, 13, 34, 10, 7, 0, 12, 22, 33, 17, 41, 29, 18, 20, 3, 40, 25, 4, 19, 24, 32, 16, 36, 14, 1, 28, 2, 8]

Motivation

In ML training, it is common to see things like

# Offline Shuffle
import numpy as np
sample_indices = np.arange(1_000_000)
np.random.shuffle(sample_indices)
for i in sample_indices:
    # do something with i
    ...

Or to do Fisher-Yates online

# Online Shuffle
import numpy as np
N = 1_000_000
sample_indices = np.arange(N)
for i in range(N):
    j = np.random.randint(i, N)
    sample_indices[i], sample_indices[j] = sample_indices[j], sample_indices[i]
    # do something with sample_indices[i]
    ...

The problem with either of these approaches is that they require O(n) memory to store the shuffled indices, and offline shuffle has a bad "time-to-first-sample" problem when we approach the scale of one billion data points. This library provides a way to generate a permutation of [0, n) using O(1) memory and O(1) time.

# Of course... first install us
pip install smallperm
import numpy as np
from smallperm import PseudoRandomPermutation as PRP
N = 1_000_000
prp = PRP(N, np.random.randint(0, np.iinfo(np.int64).max+1)) # O(1) time generates the permutation
print(prp[0], prp[50]) # We support O(1) random indexing, just like an array
assert 50 == prp.backward(prp[50]) # We support O(1) backward mapping

for ix in prp:
    # do something with ix
    ...

For most ML use cases this should be Pareto optimal: it is faster than Fisher-Yates, uses much less memory, and has a much better time-to-first-sample than offline shuffle. In other words, we used O(1) time and O(1) space to generate arr = np.arange(N); np.random.shuffle(arr), kind of magical, at the slight cost of some shuffling quality, but hey, in ML training when we constantly have > 1M data points it's not like our PRNG keys can represent the entire space of permutations anyway.

API

  • Initialization: PseudoRandomPermutation(length: int, seed: int)

    • Generates a permutation of [0, length) using seed. We impose no restriction on length (except it fits under an unsigned 128-bit integer).
  • Usage: Iterate over the instance to get the next element of the permutation.

    • Example: list(PseudoRandomPermutation(42, 0xDEADBEEF))
  • O(1) forward/backward mapping:

    • forward(i: int) -> int: Returns the i-th element of the permutation (regardless of the current state of the iterator).
    • backward(el: int) -> int: Returns the index of el in the permutation.

Features

  1. Hard-ware independent (i.e., reproducible across different machines, with the same seed) shuffling. This repo, barring major bugs, will not change the permutation generated by a given seed (in which case we will do major version bump).
  2. Extremely fast. On my MBP iterating through the array is only 2x-3x slower than iterating throw a arange(N) array.

How

We use a (somewhat) weak albeit fast symmetric cipher to generate the permutation. The resulting shuffle quality is not as high as Fisher-Yates shuffle, but it is extremely efficient. Compared to Fisher-Yates, we use O(1) memory (as opposed to O(n), n the length of the shuffle); fix $\sigma$ a permutation (i.e., PseudoRandomPermutation(n, seed)) which maps ${0, 1, \ldots, n-1}$ to itself, we have $O(1)$ $\sigma(x)$ and $\sigma^{-1}(y)$, which can be very desirable properties in distributed ML training.

More examples

Shuffle non [0, n) sequences

PRP is a primitive yet powerful object, and can be composed.

from smallperm import PseudoRandomPermutation as PRP
from itertools import product

deck = list(product(range(4), range(13)))
prp = PRP(len(deck), 0xDEADBEEF)
shuffled_deck = [deck[i] for i in prp]

Replacing two random.shuffle

It is common to have two random.shuffle when there is a global shuffle and there is a "within-shard" shuffle.

from smallperm import PseudoRandomPermutation as PRP

seeds = (42, 0)
N = 1_000_000
local_indices = range(2, N, 8) # e.g., locally yield every 8th element
local_n = len(local_indices)

global_shuffle = PRP(N, seeds[0])
local_shuffle = PRP(len(local_indices), seeds[1])

for i in range(local_n):
  print(global_shuffle[local_indices[local_shuffle[i]]])
  # Compare and contrast with print(global_shuffle[local_indices[i]]),
  #   esp. with multiple epochs.

Infinite PRP with new shuffles per epoch

from itertools import chain, count
import numpy as np

N = 10_000_000
rng = np.random.default_rng(0xDEADBEEF)

uint32_upper_bound = 0x100000000

infinite_prp = chain.from_iterable(PRP(N, rng.integers(0, uint32_upper_bound)) for _ in count())

Acknowledgements

Gratefully modifies and reuses code from https://github.com/asimihsan/permutation-iterator-rs which does most of the heavy lifting. Because the heavy lifting is done in Rust, this library is very efficient.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smallperm-0.1.12.tar.gz (23.0 kB view hashes)

Uploaded Source

Built Distributions

smallperm-0.1.12-cp37-abi3-win_amd64.whl (131.1 kB view hashes)

Uploaded CPython 3.7+ Windows x86-64

smallperm-0.1.12-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (264.0 kB view hashes)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ x86-64

smallperm-0.1.12-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (455.9 kB view hashes)

Uploaded CPython 3.7+ macOS 10.12+ universal2 (ARM64, x86-64) macOS 10.12+ x86-64 macOS 11.0+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page