Efficient O(1)-space pseudorandom permutation; high speed shuffling for large data
Project description
smallperm
Small library to generate permutations of a list of elements using pseudo-random permutations (PRP). Uses O(1)
memory and O(1)
time to generate the next element of the permutation.
>>> from smallperm import PseudoRandomPermutation
>>> list(PseudoRandomPermutation(42, 0xDEADBEEF))
[30, 11, 23, 21, 39, 9, 26, 5, 27, 38, 15, 37, 31, 35, 6, 13, 34, 10, 7, 0, 12, 22, 33, 17, 41, 29, 18, 20, 3, 40, 25, 4, 19, 24, 32, 16, 36, 14, 1, 28, 2, 8]
Motivation
In ML training, it is common to see things like
# Offline Shuffle
import numpy as np
sample_indices = np.arange(1_000_000)
np.random.shuffle(sample_indices)
for i in sample_indices:
# do something with i
...
Or to do Fisher-Yates online
# Online Shuffle
import numpy as np
N = 1_000_000
sample_indices = np.arange(N)
for i in range(N):
j = np.random.randint(i, N)
sample_indices[i], sample_indices[j] = sample_indices[j], sample_indices[i]
# do something with sample_indices[i]
...
The problem with either of these approaches is that they require O(n)
memory to store the shuffled indices, and offline shuffle has a bad "time-to-first-sample" problem when we approach the scale of one billion data points. This library provides a way to generate a permutation of [0, n)
using O(1)
memory and O(1)
time.
# Of course... first install us
pip install smallperm
import numpy as np
from smallperm import PseudoRandomPermutation as PRP
N = 1_000_000
prp = PRP(N, np.random.randint(0, np.iinfo(np.int64).max+1)) # O(1) time generates the permutation
print(prp[0], prp[50]) # We support O(1) random indexing, just like an array
assert 50 == prp.backward(prp[50]) # We support O(1) backward mapping
for ix in prp:
# do something with ix
...
For most ML use cases this should be Pareto optimal: it is faster than Fisher-Yates, uses much less memory, and has a much better time-to-first-sample than offline shuffle. In other words, we used O(1)
time and O(1)
space to generate arr = np.arange(N); np.random.shuffle(arr)
, kind of magical, at the slight cost of some shuffling quality, but hey, in ML training when we constantly have > 1M data points it's not like our PRNG keys can represent the entire space of permutations anyway.
API
-
Initialization:
PseudoRandomPermutation(length: int, seed: int)
- Generates a permutation of
[0, length)
usingseed
. We impose no restriction onlength
(except it fits under an unsigned 128-bit integer).
- Generates a permutation of
-
Usage: Iterate over the instance to get the next element of the permutation.
- Example:
list(PseudoRandomPermutation(42, 0xDEADBEEF))
- Example:
-
O(1) forward/backward mapping:
forward(i: int) -> int
: Returns thei
-th element of the permutation (regardless of the current state of the iterator).backward(el: int) -> int
: Returns the index ofel
in the permutation.
Features
- Hard-ware independent (i.e., reproducible across different machines, with the same seed) shuffling. This repo, barring major bugs, will not change the permutation generated by a given seed (in which case we will do major version bump).
- Extremely fast. On my MBP iterating through the array is only 2x-3x slower than iterating throw a
arange(N)
array.
How
We use a (somewhat) weak albeit fast symmetric cipher to generate the permutation. The resulting shuffle quality is not as high as Fisher-Yates shuffle, but it is extremely efficient. Compared to Fisher-Yates, we use O(1)
memory (as opposed to O(n)
, n
the length of the shuffle); fix $\sigma$ a permutation (i.e., PseudoRandomPermutation(n, seed)
) which maps ${0, 1, \ldots, n-1}$ to itself, we have $O(1)$ $\sigma(x)$ and $\sigma^{-1}(y)$, which can be very desirable properties in distributed ML training.
More examples
Shuffle non [0, n)
sequences
PRP
is a primitive yet powerful object, and can be composed.
from smallperm import PseudoRandomPermutation as PRP
from itertools import product
deck = list(product(range(4), range(13)))
prp = PRP(len(deck), 0xDEADBEEF)
shuffled_deck = [deck[i] for i in prp]
Replacing two random.shuffle
It is common to have two random.shuffle
when there is a global shuffle and there is a "within-shard" shuffle.
from smallperm import PseudoRandomPermutation as PRP
seeds = (42, 0)
N = 1_000_000
local_indices = range(2, N, 8) # e.g., locally yield every 8th element
local_n = len(local_indices)
global_shuffle = PRP(N, seeds[0])
local_shuffle = PRP(len(local_indices), seeds[1])
for i in range(local_n):
print(global_shuffle[local_indices[local_shuffle[i]]])
# Compare and contrast with print(global_shuffle[local_indices[i]]),
# esp. with multiple epochs.
Infinite PRP with new shuffles per epoch
from itertools import chain, count
import numpy as np
N = 10_000_000
rng = np.random.default_rng(0xDEADBEEF)
uint32_upper_bound = 0x100000000
infinite_prp = chain.from_iterable(PRP(N, rng.integers(0, uint32_upper_bound)) for _ in count())
Acknowledgements
Gratefully modifies and reuses code from https://github.com/asimihsan/permutation-iterator-rs which does most of the heavy lifting. Because the heavy lifting is done in Rust, this library is very efficient.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for smallperm-0.1.12-cp37-abi3-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 562b627125556f9f6bc3c5b12de3c0ad34c118f6368245c8eeea109c2eb30eac |
|
MD5 | 75e16afd3c6f125784ade2d40e96b3ab |
|
BLAKE2b-256 | 604e3ada1195c209d8e9301ee2d752194f5b8b7a29de97d641a311efe9e039e5 |
Hashes for smallperm-0.1.12-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 79fa4ad15986a934b2b8eeef39268520b8285e85ef241b66eee670357a4f5d22 |
|
MD5 | 903d9125314f01c9a30af5b74e47ad69 |
|
BLAKE2b-256 | fb6f23075e2e07543493c80d1299c9dd774aaa2546988521261bf5e88d54d03a |
Hashes for smallperm-0.1.12-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 09db2d53a940b6aafefa83accfecbcf0563177ba6d916ea1b597bb2e839a7974 |
|
MD5 | cc7b47ee65f3ce885972df32c0a50820 |
|
BLAKE2b-256 | e8b3ef38bdfaefded2d5899d7a569a994a83507d2d5bc73d60cf07a6dfbbd090 |