Efficient O(1)-space pseudorandom permutation; no more random.shuffle or Fisher-Yates
Project description
smallperm
Small library to generate permutations of a list of elements using pseudo-random permutations (PRP). Uses O(1)
memory and O(1)
time to generate the next element of the permutation.
>>> from smallperm import PseudoRandomPermutation
>>> list(PseudoRandomPermutation(42, 0xDEADBEEF))
[30, 11, 23, 21, 39, 9, 26, 5, 27, 38, 15, 37, 31, 35, 6, 13, 34, 10, 7, 0, 12, 22, 33, 17, 41, 29, 18, 20, 3, 40, 25, 4, 19, 24, 32, 16, 36, 14, 1, 28, 2, 8]
Motivation
In ML training, it is common to see things like
# Offline Shuffle
import numpy as np
sample_indices = np.arange(1_000_000)
np.random.shuffle(sample_indices)
for i in sample_indices:
# do something with i
...
Or to do Fisher-Yates online
# Online Shuffle
import numpy as np
N = 1_000_000
sample_indices = np.arange(N)
for i in range(N):
j = np.random.randint(i, N)
sample_indices[i], sample_indices[j] = sample_indices[j], sample_indices[i]
# do something with sample_indices[i]
...
The problem with either of these approaches is that they require O(n)
memory to store the shuffled indices, and offline shuffle has a bad "time-to-first-sample" problem when we approach the scale of one billion data points. This library provides a way to generate a permutation of [0, n)
using O(1)
memory and O(1)
time.
# Of course... first install us
pip install smallperm
import numpy as np
from smallperm import PseudoRandomPermutation as PRP
N = 1_000_000
prp = PRP(N, np.random.randint(0, np.iinfo(np.int64).max+1)) # O(1) time generates the permutation
print(prp[0], prp[50]) # We support O(1) random indexing, just like an array
assert 50 == prp.backward(prp[50]) # We support O(1) backward mapping
for ix in prp:
# do something with ix
...
For most ML use cases this should be Pareto optimal: it is faster than Fisher-Yates, uses much less memory, and has a much better time-to-first-sample than offline shuffle. In other words, we used O(1)
time and O(1)
space to generate arr = np.arange(N); np.random.shuffle(arr)
, kind of magical, at the slight cost of some shuffling quality, but hey, in ML training when we constantly have > 1M data points it's not like our PRNG keys can represent the entire space of permutations anyway.
API
-
Initialization:
PseudoRandomPermutation(length: int, seed: int)
- Generates a permutation of
[0, length)
usingseed
. We impose no restriction onlength
(except it fits under an unsigned 128-bit integer).
- Generates a permutation of
-
Usage: Iterate over the instance to get the next element of the permutation.
- Example:
list(PseudoRandomPermutation(42, 0xDEADBEEF))
- Example:
-
O(1) forward/backward mapping:
forward(i: int) -> int
: Returns thei
-th element of the permutation (regardless of the current state of the iterator).backward(el: int) -> int
: Returns the index ofel
in the permutation.
Features
- Hard-ware independent (i.e., reproducible across different machines, with the same seed) shuffling. This repo, barring major bugs, will not change the permutation generated by a given seed (in which case we will do major version bump).
- Extremely fast. On my MBP iterating through the array is only 2x-3x slower than iterating throw a
arange(N)
array.
How
We use a (somewhat) weak albeit fast symmetric cipher to generate the permutation. The resulting shuffle quality is not as high as Fisher-Yates shuffle, but it is extremely efficient. Compared to Fisher-Yates, we use O(1)
memory (as opposed to O(n)
, n
the length of the shuffle); fix $\sigma$ a permutation (i.e., PseudoRandomPermutation(n, seed)
) which maps ${0, 1, \ldots, n-1}$ to itself, we have $O(1)$ $\sigma(x)$ and $\sigma^{-1}(y)$, which can be very desirable properties in distributed ML training.
Acknowledgements
Gratefully modifies and reuses code from https://github.com/asimihsan/permutation-iterator-rs which does most of the heavy lifting. Because the heavy lifting is done in Rust, this library is very efficient.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for smallperm-0.1.11-cp37-abi3-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53ddcdc7b846fa2504072c4a8835c26be55b18cfcf8864ba928feaa17decfb65 |
|
MD5 | 9ddc2811c84f9bafe8a16d76f9e36972 |
|
BLAKE2b-256 | 65eda50b5f10370c999c618b409a3b8882aec3f8cbbbdbf6b31c0a79703a4b10 |
Hashes for smallperm-0.1.11-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3309f958527d3cb4f108402d4cac377cd1097570cf617e2ba845159a2d8b91f2 |
|
MD5 | 9e440913f25656e060722b5d92e4bc5e |
|
BLAKE2b-256 | 3e2a94587abeceaa39c39a9b73d840d26fc84a3f9a1765222754bc9e6636fb99 |
Hashes for smallperm-0.1.11-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d276057617445c590e868a7459b4cc232022d11f63e9115b9889210e46e5c766 |
|
MD5 | 08357d1cd5c36e237cfe403134331ea5 |
|
BLAKE2b-256 | e8be9410a6576da1b5f05b88665222f081c3443a1019325b21ae242bf6163136 |