Skip to main content

Shared-memory streams for NumPy and CUDA-backed PyTorch pipelines.

Project description

pyshmem

PyPI | Documentation | Source | Issues

pyshmem provides named shared-memory streams for NumPy arrays and optional CUDA-backed PyTorch pipelines.

It is designed for applications that need a small, predictable API for moving numeric payloads between processes without rebuilding the same locking, metadata, and lifecycle rules around raw shared memory.

Why pyshmem

  • one API for CPU NumPy buffers and CUDA-backed tensors
  • cross-process write locking with explicit lock ownership
  • safe snapshot reads for CPU streams
  • explicit GPU performance mode or CPU-mirrored compatibility mode
  • tested lifecycle and recovery behavior across supported platforms

Installation

Install from PyPI:

pip install pyshmem

Optional extras:

pip install pyshmem[test]
pip install pyshmem[gpu]
pip install pyshmem[docs]

For local development from a checkout:

pip install -e .[test]

Quick Start

CPU stream

import numpy as np
import pyshmem

writer = pyshmem.create("demo_frame", shape=(4, 4), dtype=np.float32)
reader = pyshmem.open("demo_frame")

writer.write(np.ones((4, 4), dtype=np.float32))
frame = reader.read()
next_frame = reader.read_new(timeout=1.0)

GPU stream

import numpy as np
import pyshmem

writer = pyshmem.create(
    "demo_cuda",
    shape=(4, 4),
    dtype=np.float32,
    gpu_device="cuda:0",
)
writer.write(np.ones((4, 4), dtype=np.float32))

reader = pyshmem.open("demo_cuda", gpu_device="cuda:0")
frame = reader.read()

Public API

  • pyshmem.SharedMemory
  • pyshmem.create(name, *, shape, dtype=np.float32, size=None, gpu_device=None, cpu_mirror=None)
  • pyshmem.open(name, *, gpu_device=None)
  • pyshmem.unlink(name)
  • pyshmem.gpu_available()

SharedMemory instances expose metadata, locking, lifecycle, and IO methods:

  • name, shape, dtype, size, gpu_device, cpu_mirror, owner
  • count, write_time, write_sequence
  • acquire(timeout=None, poll_interval=1e-3)
  • release()
  • locked(timeout=None, poll_interval=1e-3)
  • write(value)
  • read(safe=True, poll_interval=1e-6)
  • read_new(timeout=None, safe=True, poll_interval=1e-5)
  • clear()
  • close()
  • unlink()
  • delete()

Behavior Notes

Writes are serialized with a cross-platform file lock backend.

  • read(safe=True) returns a consistent snapshot of the most recent completed write
  • read(safe=False) exposes the live backing storage and therefore requires with shm.locked():
  • close() releases only the local handle
  • unlink() destroys the underlying shared-memory stream

Closed handles are guarded explicitly. After close(), methods such as read, write, acquire, clear, and metadata access raise a RuntimeError that instructs the caller to reopen the stream.

Missing segments raise FileNotFoundError with a pyshmem-specific message that points the caller toward pyshmem.create(...).

GPU Modes

GPU-backed streams have two deliberately different operating modes.

Performance mode:

  • pyshmem.create(..., gpu_device="cuda:N") defaults to cpu_mirror=False
  • avoids CPU mirror maintenance on every write
  • optimized for GPU-heavy pipelines where throughput matters most

Compatibility mode:

  • pyshmem.create(..., gpu_device="cuda:N", cpu_mirror=True) keeps the CPU mirror updated
  • allows CPU-side payload reads and stronger safe-snapshot semantics under concurrent writes

Important attachment rule:

  • pass gpu_device="cuda:N" to pyshmem.open(...) whenever the caller needs a CUDA torch.Tensor view
  • opening a GPU stream without gpu_device still allows metadata inspection and lock management, but payload reads require either a GPU attachment or cpu_mirror=True

Platform Notes

Windows limitation

Windows inherits a hard limitation from multiprocessing.shared_memory: the operating system deletes a shared-memory block as soon as the last handle to it is closed.

That means the following behaviors are unsupported on Windows:

  • a segment outliving its creator when no other process still has it open
  • close() followed by pyshmem.open(...) when that close() dropped the final live handle

Those behaviors remain supported on POSIX platforms.

Testing

Install test dependencies and run the CPU suite:

pip install -e .[test]
pytest -m cpu

Run the CUDA suite on a GPU machine:

pip install -e .[test,gpu]
pytest -m gpu

The repository also includes benchmark-marked tests:

pytest -m "cpu and benchmark" -q -s
pytest tests/test_benchmark.py -m "gpu and benchmark" -q -s

GitHub-hosted runners do not provide CUDA by default, so the CUDA workflow is manual and targets either a self-hosted GPU runner or a larger GitHub runner with CUDA support.

Performance

The benchmark suite measures both raw shared-memory IO and matrix-vector multiply pipelines that keep the matrix in shared memory.

Two GPU MVM shapes are covered:

  • host-upload pipeline: the vector payload is created in NumPy and uploaded each iteration
  • device-resident pipeline: the vector payload is produced directly on GPU each iteration

The CPU benchmark target remains 50 kHz for a 128x128 round trip. Hard enforcement is opt-in because hosted CI is not a reliable performance lab.

Measured Results

The following numbers were measured on this machine:

  • OS: Linux 6.17.0-14-generic x86_64
  • Python: 3.12.0
  • NumPy: 2.2.6
  • PyTorch: 2.10.0+cu128
  • GPU: NVIDIA GeForce RTX 5090

Methodology:

  • float32 payloads throughout
  • each benchmark case used warmup iterations before timing
  • each timed case ran for at least 1.5 seconds to reduce one-off noise
  • IO throughput is computed from write plus read bytes per iteration
  • MVM throughput is reported both as pipeline rate and estimated GFLOP/s using $2n^2$ floating-point operations per matrix-vector multiply

Important interpretation note:

  • GPU-backed segments now default to cpu_mirror=False
  • the fast GPU path avoids CPU mirror maintenance unless the creator explicitly asks for it with cpu_mirror=True
  • the stronger concurrent-read consistency contract is provided by the mirrored mode; the default no-mirror mode is optimized for throughput first
  • the GPU numbers below therefore reflect the optimized no-mirror path, which is the intended performance configuration

IO vs Image Size

Image size Payload (MiB) CPU roundtrip Hz CPU IO (GB/s) GPU roundtrip Hz GPU IO (GB/s)
100x100 0.038 180311.2 14.42 36214.1 2.90
1000x1000 3.815 9922.1 79.38 5027.4 40.22
10000x10000 381.470 20.36 16.29 49.96 39.97

Shared-Memory MVM Pipeline

Host-upload GPU pipeline:

Matrix size Matrix payload (MiB) CPU pipeline Hz CPU GFLOP/s GPU pipeline Hz GPU GFLOP/s
100x100 0.038 109844.4 2.20 26465.8 0.53
1000x1000 3.815 11124.9 22.25 22485.3 44.97
10000x10000 381.470 26.21 5.24 1299.3 259.86

Fully device-resident GPU pipeline:

Matrix size Matrix payload (MiB) GPU pipeline Hz GPU GFLOP/s
100x100 0.038 30240.6 0.60
1000x1000 3.815 26733.6 53.47
10000x10000 381.470 1321.6 264.33

The updated results show the intended behavior for real GPU workloads:

  • tiny matrices like 100x100 are still dominated by launch and synchronization overhead, so CPU remains faster there
  • once the workload is large enough to matter, the no-mirror GPU path pulls ahead decisively
  • the 1000x1000 and 10000x10000 MVM cases now outperform the CPU equivalents by a wide margin on this machine
  • keeping the vector generation on GPU improves the pipeline further, especially once the matrix is large enough for the math to dominate

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyshmem-1.0.2.tar.gz (34.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyshmem-1.0.2-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file pyshmem-1.0.2.tar.gz.

File metadata

  • Download URL: pyshmem-1.0.2.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyshmem-1.0.2.tar.gz
Algorithm Hash digest
SHA256 fba40fcd4ffdb95025b40124d02daeb8a98effabdbe2b49bedd8f4b0e549bd32
MD5 c23b43da2694fde2f35fc8ef18809f58
BLAKE2b-256 665d73a0cfa82a711adafe491854cc1b7365e80522d335a2aa18faba78f186f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyshmem-1.0.2.tar.gz:

Publisher: pypi.yml on jacotay7/pyshmem

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyshmem-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: pyshmem-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyshmem-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e0901c2b328b35e33d45a2be1c75d2555e6fa8b1bcb018ce4da920eab9caa0ef
MD5 ebd3b9aff81f755022e24caa8831ba8a
BLAKE2b-256 de38c3aae4466df14a47929b91c13e4c8ae32c5c1df72936a39f6ccad8ac168e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyshmem-1.0.2-py3-none-any.whl:

Publisher: pypi.yml on jacotay7/pyshmem

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page