Efficient O(1)-space pseudorandom permutation for large scale shuffling
Project description
smallperm
Small library to generate permutations of a list of elements using pseudo-random permutations (PRP). Uses O(1)
memory and O(1)
time to generate the next element of the permutation.
>>> from smallperm import PseudoRandomPermutation
>>> list(PseudoRandomPermutation(42, 0xDEADBEEF))
[30, 11, 23, 21, 39, 9, 26, 5, 27, 38, 15, 37, 31, 35, 6, 13, 34, 10, 7, 0, 12, 22, 33, 17, 41, 29, 18, 20, 3, 40, 25, 4, 19, 24, 32, 16, 36, 14, 1, 28, 2, 8]
Motivation
In ML training, it is common to see things like
# Offline Shuffle
import numpy as np
sample_indices = np.arange(1_000_000)
np.random.shuffle(sample_indices)
for i in sample_indices:
# do something with i
...
Or to do Fisher-Yates online
# Online Shuffle
import numpy as np
N = 1_000_000
sample_indices = np.arange(N)
for i in range(N):
j = np.random.randint(i, N)
sample_indices[i], sample_indices[j] = sample_indices[j], sample_indices[i]
# do something with sample_indices[i]
...
The problem with either of these approaches is that they require O(n)
memory to store the shuffled indices, and offline shuffle has a bad "time-to-first-sample" problem when we approach the scale of one billion data points. This library provides a way to generate a permutation of [0, n)
using O(1)
memory and O(1)
time.
# Of course... first install us
pip install smallperm
import numpy as np
from smallperm import PseudoRandomPermutation as PRP
N = 1_000_000
prp = PRP(N, np.random.randint(0, np.iinfo(np.int64).max+1)) # O(1) time generates the permutation
print(prp[0], prp[50]) # We support O(1) random indexing, just like an array
assert 50 == prp.backward(prp[50]) # We support O(1) backward mapping
for ix in prp:
# do something with ix
...
For most ML use cases this should be Pareto optimal: it is faster than Fisher-Yates, uses much less memory, and has a much better time-to-first-sample than offline shuffle. In other words, we used O(1)
time and O(1)
space to generate arr = np.arange(N); np.random.shuffle(arr)
, kind of magical, at the slight cost of some shuffling quality, but hey, in ML training when we constantly have > 1M data points it's not like our PRNG keys can represent the entire space of permutations anyway.
API
-
Initialization:
PseudoRandomPermutation(length: int, seed: int)
- Generates a permutation of
[0, length)
usingseed
. We impose no restriction onlength
(except it fits under an unsigned 128-bit integer).
- Generates a permutation of
-
Usage: Iterate over the instance to get the next element of the permutation.
- Example:
list(PseudoRandomPermutation(42, 0xDEADBEEF))
- Example:
-
O(1) forward/backward mapping:
forward(i: int) -> int
: Returns thei
-th element of the permutation (regardless of the current state of the iterator).backward(el: int) -> int
: Returns the index ofel
in the permutation.
Features
- Hard-ware independent (i.e., reproducible across different machines, with the same seed) shuffling. This repo, barring major bugs, will not change the permutation generated by a given seed (in which case we will do major version bump).
- Extremely fast. On my MBP iterating through the array is only 2x-3x slower than iterating throw a
arange(N)
array.
How
We use a (somewhat) weak albeit fast symmetric cipher to generate the permutation. The resulting shuffle quality is not as high as Fisher-Yates shuffle, but it is extremely efficient. Compared to Fisher-Yates, we use O(1)
memory (as opposed to O(n)
, n
the length of the shuffle); fix $\sigma$ a permutation (i.e., PseudoRandomPermutation(n, seed)
) which maps ${0, 1, \ldots, n-1}$ to itself, we have $O(1)$ $\sigma(x)$ and $\sigma^{-1}(y)$, which can be very desirable properties in distributed ML training.
Acknowledgements
Gratefully modifies and reuses code from https://github.com/asimihsan/permutation-iterator-rs which does most of the heavy lifting. Because the heavy lifting is done in Rust, this library is very efficient.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for smallperm-0.1.10-cp37-abi3-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ba93cf52e9662ff5d9c4f3ac69d94874c37dfa7b1f0d5c0edc087959f00b3f2 |
|
MD5 | cda3183d4f71c415add6d59768fbfb0b |
|
BLAKE2b-256 | a1765bf9505a0dcbc39f24db0ee5f54abf91da023d4fb0106e09ec59c43fd1bc |
Hashes for smallperm-0.1.10-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e232338997dcd63b63c9b2124e8328b6fe392ac22f44f86400cfcec1f7b27875 |
|
MD5 | eb3089fbb4c295e06a05592b11bf8a3e |
|
BLAKE2b-256 | c540e9139be28b71fbac20fc1010695ec6f1369c2656949cc96dca921b3d2bf0 |
Hashes for smallperm-0.1.10-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd317b3932ccb333b3b4866aff20ca7866c38fa97a39b684fcfc23b64174762d |
|
MD5 | 51812fb7191777347fa3e35e19ecafc2 |
|
BLAKE2b-256 | 02c62f2067597aec60b9a08ef11ff7f116b57f03e221b79ab9b013f9292c3e3c |