Skip to main content

Manage pool of array using shared memory

Project description

Pyarraypool

Licence Python build Rust build PyPI PyPI - Implementation PyPI - Python Version

Transfer numpy array between processes using shared memory.

Why creating this project ?

This library aims to speed up parallel data processing with CPython and numpy NDArray.

What is the issue with regular python multitasking ?

Python GIL does not permit to use multithreading for parallel data processing. It is indeed release when C code / Cython (nogil) / IO tasks are done but it is still lock for computation tasks. This is why subprocess are often used to have multiple processing task done.

Alternative to subprocess worker exists but they are not always possible to use. To list few of them:

  • numba with prange
  • switching from CPython to PyPy
  • rewrite code using C / Cython / Rust
  • using ray

Why not using numpy builtin mmap ?

Numpy builtin memory mapping is made to manage a single numpy array. It is not made to manage multiple "small" array that are frequently created / destroy.

Why not using big framework like ray ?

Ray API is great / super simple / efficient.

Unfortunately it is slow because under the hood is still use pickle to transfer data between processes. It is not optimized for local and CPU bound simple computation.

See:

Few design choices

Python standard library already contains a module to create and manage shared memory.

However it does not permit to manage it as a RAW bloc safely and easily. So performances drop because several system call must be done on each bloc creation / deletion.

In this library:

  • shared memory is manage as a "pool" using Rust and low level CPython API.
  • array can be attached and are release when refcount reach 0 in every processes.
  • a spinlock is used to manage sync between process when bloc are add / removed (this can be improved).

API usage

Here a simple example of how to use library.

import pyarraypool
import multiprocessing
import numpy as np

def task(x, i, value):
    # Define a dummy task than read and write to shared numpy array
    x[i, :, :] = value

def main():
    arr = np.random.random((100, 200, 500))
    I, J, K = arr.shape

    with multiprocessing.Pool(processes=8) as pool:
        # Transfer the array to shared memory.
        #
        # Segment will be created automatically on first `make_transferable` call.
        shmarr = pyarraypool.make_transferable(arr)

        # Apply task to array
        pool.starmap(task, [
            (shmarr, i, i) for i in range(I)
        ])

if __name__ == "__main__":
    main()

You can have a look at notebook / example folders for more details.

Fighting python GC

There is few things to know about array releasing.

When array are created in subprocess, GC can collect them before they are transfered in main process.

Exemple sequence:

  1. Main process trigger subprocess job.
  2. Subprocess has something to return and create a shared array in function return.
  3. Subprocess GC is triggered and cleanup everything included return value. So refcount reach 0.
  4. Main process wakes up and try to get subprocess return value. Value has already been collect by GC and release from pool => CRASH !

To fix this issue, there is a flag associated to each transferable array. This flag is set when object has been transfered between 2 processes. If this flag is not set, object will not be released.

In case you want to create a transferable object (but do not want to transfer it between processes), you can override this flag using:

shmarr = pyarraypool.make_transferable(arr, transfer_required=False)

Object will be releas from array pool as soon as refcount reach 0.

Developper guide

To build:

pip install maturin
maturin develop --extras test

To test:

# Run rust tests
cargo test
cargo clippy

# Run python tests
pytest -vv
flake8
autopep8 --diff -r python/
mypy .

To format code:

autopep8 -ir python/
isort .

Project status

Project is currently a "POC" and not fully ready for production.

Few benchmark are still missing. API can be improved and may change in near future.

See TODO.md for more details.

Any help / feedback is welcome 😊 !

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pyarraypool-0.1.4-cp310-cp310-manylinux_2_28_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

pyarraypool-0.1.4-cp39-cp39-manylinux_2_28_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

pyarraypool-0.1.4-cp38-cp38-manylinux_2_28_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

File details

Details for the file pyarraypool-0.1.4-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pyarraypool-0.1.4-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fc44a74e6439c9e20a0729759e3b12a6b80c70494e9a353d32218d9b33d7a7ef
MD5 7b2f635beb999b2edea8998bba4faecf
BLAKE2b-256 d76be7f3beef821ab305fb38321b4ce4cdf780e8e3d227c1ee9c41bdb0dd6490

See more details on using hashes here.

File details

Details for the file pyarraypool-0.1.4-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pyarraypool-0.1.4-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4588a482a1693432547f6f0ab9f850c0cb4e4df1f19f3c85fe5958373b54998f
MD5 75203c05062d62c8823a172370b3f6ff
BLAKE2b-256 eedabdc5c03dd02e780267ea13a5e15a2b70a5a1e311bd34ad78aae919750a47

See more details on using hashes here.

File details

Details for the file pyarraypool-0.1.4-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pyarraypool-0.1.4-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 58d548913f7ac749e95bf5c0192f1052029d9f2daedf5e05e5859b1d5a7f318a
MD5 73cb24ebbbf8eaf1c33ec4d1bfffde96
BLAKE2b-256 578fb9a87e3d86b5c97535b8b4e64428438718f9bb032aae740d5100e459c0ab

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page