Skip to main content

Efficient O(1)-space pseudorandom permutation; high speed shuffling for large data

Project description

smallperm

PyPI

Small library to generate permutations of a list of elements using pseudo-random permutations (PRP). Uses O(1) memory and O(1) time to generate the next element of the permutation.

>>> from smallperm import PseudoRandomPermutation
>>> list(PseudoRandomPermutation(42, 0xDEADBEEF))
[30, 11, 23, 21, 39, 9, 26, 5, 27, 38, 15, 37, 31, 35, 6, 13, 34, 10, 7, 0, 12, 22, 33, 17, 41, 29, 18, 20, 3, 40, 25, 4, 19, 24, 32, 16, 36, 14, 1, 28, 2, 8]

Motivation

In ML training, it is common to see things like

# Offline Shuffle
import numpy as np
sample_indices = np.arange(1_000_000)
np.random.shuffle(sample_indices)
for i in sample_indices:
    # do something with i
    ...

Or to do Fisher-Yates online

# Online Shuffle
import numpy as np
N = 1_000_000
sample_indices = np.arange(N)
for i in range(N):
    j = np.random.randint(i, N)
    sample_indices[i], sample_indices[j] = sample_indices[j], sample_indices[i]
    # do something with sample_indices[i]
    ...

The problem with either of these approaches is that they require O(n) memory to store the shuffled indices, and offline shuffle has a bad "time-to-first-sample" problem when we approach the scale of one billion data points. This library provides a way to generate a permutation of [0, n) using O(1) memory and O(1) time.

# Of course... first install us
pip install smallperm
import numpy as np
from smallperm import PseudoRandomPermutation as PRP
N = 1_000_000
prp = PRP(N, np.random.randint(0, np.iinfo(np.int64).max+1)) # O(1) time generates the permutation
print(prp[0], prp[50]) # We support O(1) random indexing, just like an array
assert 50 == prp.backward(prp[50]) # We support O(1) backward mapping

for ix in prp:
    # do something with ix
    ...

For most ML use cases this should be Pareto optimal: it is faster than Fisher-Yates, uses much less memory, and has a much better time-to-first-sample than offline shuffle. In other words, we used O(1) time and O(1) space to generate arr = np.arange(N); np.random.shuffle(arr), kind of magical, at the slight cost of some shuffling quality, but hey, in ML training when we constantly have > 1M data points it's not like our PRNG keys can represent the entire space of permutations anyway.

API

  • Initialization: PseudoRandomPermutation(length: int, seed: int)

    • Generates a permutation of [0, length) using seed. We impose no restriction on length (except it fits under an unsigned 128-bit integer).
  • Usage: Iterate over the instance to get the next element of the permutation.

    • Example: list(PseudoRandomPermutation(42, 0xDEADBEEF))
  • O(1) forward/backward mapping:

    • forward(i: int) -> int: Returns the i-th element of the permutation (regardless of the current state of the iterator).
    • backward(el: int) -> int: Returns the index of el in the permutation.

Features

  1. Hard-ware independent (i.e., reproducible across different machines, with the same seed) shuffling. This repo, barring major bugs, will not change the permutation generated by a given seed (in which case we will do major version bump).
  2. Extremely fast. On my MBP iterating through the array is only 2x-3x slower than iterating throw a arange(N) array.

How

We use a (somewhat) weak albeit fast symmetric cipher to generate the permutation. The resulting shuffle quality is not as high as Fisher-Yates shuffle, but it is extremely efficient. Compared to Fisher-Yates, we use O(1) memory (as opposed to O(n), n the length of the shuffle); fix $\sigma$ a permutation (i.e., PseudoRandomPermutation(n, seed)) which maps ${0, 1, \ldots, n-1}$ to itself, we have $O(1)$ $\sigma(x)$ and $\sigma^{-1}(y)$, which can be very desirable properties in distributed ML training.

More examples

Shuffle non [0, n) sequences

PRP is a primitive yet powerful object, and can be composed.

from smallperm import PseudoRandomPermutation as PRP
from itertools import product

deck = list(product(range(4), range(13)))
prp = PRP(len(deck), 0xDEADBEEF)
shuffled_deck = [deck[i] for i in prp]

Replacing two random.shuffle

It is common to have two random.shuffle when there is a global shuffle and there is a "within-shard" shuffle.

from smallperm import PseudoRandomPermutation as PRP

seeds = (42, 0)
N = 1_000_000
local_indices = range(2, N, 8) # e.g., locally yield every 8th element
local_n = len(local_indices)

global_shuffle = PRP(N, seeds[0])
local_shuffle = PRP(len(local_indices), seeds[1])

for i in range(local_n):
  print(global_shuffle[local_indices[local_shuffle[i]]])
  # Compare and contrast with print(global_shuffle[local_indices[i]]),
  #   esp. with multiple epochs.

Infinite PRP with new shuffles per epoch

from itertools import chain, count
import numpy as np

N = 10_000_000
rng = np.random.default_rng(0xDEADBEEF)

uint32_upper_bound = 0x100000000

infinite_prp = chain.from_iterable(PRP(N, rng.integers(0, uint32_upper_bound)) for _ in count())

Acknowledgements

Gratefully modifies and reuses code from https://github.com/asimihsan/permutation-iterator-rs which does most of the heavy lifting. Because the heavy lifting is done in Rust, this library is very efficient.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smallperm-0.1.12.tar.gz (23.0 kB view details)

Uploaded Source

Built Distributions

smallperm-0.1.12-cp37-abi3-win_amd64.whl (131.1 kB view details)

Uploaded CPython 3.7+ Windows x86-64

smallperm-0.1.12-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (264.0 kB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ x86-64

smallperm-0.1.12-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (455.9 kB view details)

Uploaded CPython 3.7+ macOS 10.12+ universal2 (ARM64, x86-64) macOS 10.12+ x86-64 macOS 11.0+ ARM64

File details

Details for the file smallperm-0.1.12.tar.gz.

File metadata

  • Download URL: smallperm-0.1.12.tar.gz
  • Upload date:
  • Size: 23.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.6.0

File hashes

Hashes for smallperm-0.1.12.tar.gz
Algorithm Hash digest
SHA256 2821da5965517a0b64b574b036da0758cbf4c1f32bd447e8353e76eef2217c71
MD5 080df1145e8d2d450bf468cd15e0180f
BLAKE2b-256 c6f202d680304af06a0194ff3b7c9bb34745f86c68ab138bcaeaebcc73ee7d1f

See more details on using hashes here.

File details

Details for the file smallperm-0.1.12-cp37-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for smallperm-0.1.12-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 562b627125556f9f6bc3c5b12de3c0ad34c118f6368245c8eeea109c2eb30eac
MD5 75e16afd3c6f125784ade2d40e96b3ab
BLAKE2b-256 604e3ada1195c209d8e9301ee2d752194f5b8b7a29de97d641a311efe9e039e5

See more details on using hashes here.

File details

Details for the file smallperm-0.1.12-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smallperm-0.1.12-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 79fa4ad15986a934b2b8eeef39268520b8285e85ef241b66eee670357a4f5d22
MD5 903d9125314f01c9a30af5b74e47ad69
BLAKE2b-256 fb6f23075e2e07543493c80d1299c9dd774aaa2546988521261bf5e88d54d03a

See more details on using hashes here.

File details

Details for the file smallperm-0.1.12-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for smallperm-0.1.12-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 09db2d53a940b6aafefa83accfecbcf0563177ba6d916ea1b597bb2e839a7974
MD5 cc7b47ee65f3ce885972df32c0a50820
BLAKE2b-256 e8b3ef38bdfaefded2d5899d7a569a994a83507d2d5bc73d60cf07a6dfbbd090

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page