Easier Shared Memory management for Triton Inference Client
Project description
Triton Shared Memory Client
A high-performance Python client for Triton Inference Server that simplifies the use of Shared Memory (SHM) for zero-copy inference.
This library is designed for scenarios where the client and the Triton server are colocated on the same machine. By using system shared memory, it avoids the overhead of serializing and deserializing tensors over gRPC/HTTP, resulting in massive throughput improvements for large data transfers.
Installation
pip install triton-shm-client
How to Use
The library exposes a TritonSHMClient that wraps the standard gRPC client but adds shared memory management capabilities.
- Initialize the client.
- Register your model: Define inputs, outputs, and allocate a shared memory pool.
- Run inference: Pass your numpy arrays directly.
import numpy as np
from triton_shm_client import TritonSHMClient
# 1. Connect to Triton
client = TritonSHMClient(url="localhost:8001")
# 2. Register the model and allocate shared memory
# This pre-allocates a pool of 4 slots, each capable of handling a batch of 8.
client.register_shm_model(
model_name="my_model",
inputs=[
("INPUT0", (3, 224, 224), np.float32),
("INPUT1", (10,), np.int32)
],
outputs=[
("OUTPUT0", (512,), np.float32),
("OUTPUT1", (5,), np.int32)
],
max_batch_size=8,
pool_size=4
)
# 3. Run inference
# The client automatically handles copying data to SHM and retrieving results.
# If the input size exceeds max_batch_size, it will be automatically chunked.
# If the pool is full, this call will block until a slot becomes available.
results = client.infer_shm(
model_name="my_model",
inputs={
"INPUT0": np.random.randn(100, 3, 224, 224).astype(np.float32),
"INPUT1": np.random.randint(0, 100, size=(100, 10)).astype(np.int32)
}
)
# results is a Dict[str, np.ndarray]
print("Output shape:", results["OUTPUT0"].shape)
Features
- Zero-Copy Inference: Uses
multiprocessing.shared_memoryto pass data to Triton without network overhead. - Automatic Pool Management: Handles the complexity of allocating, registering, and cleaning up shared memory regions.
- Transparent Batching: Seamlessly handles inputs larger than the model's
max_batch_sizeby chunking requests into smaller batches and reassembling the results. - Blocking Flow Control: If the shared memory pool is full, inference requests automatically block until a slot is free, providing simple backpressure.
- NumPy Integration: Native support for NumPy arrays for both inputs and outputs.
- Drop-in Replacement: Extends the standard
InferenceServerClient, so you can still use standard gRPC methods if needed. - Automatic Cleanup: Registers
atexithandlers to ensure shared memory regions are unlinked even if the script exits unexpectedly.
Limitations
- Local Only: The Triton Inference Server must be running on the same machine as the client, as they share system memory.
- Linux Only: Currently tested and supported primarily on Linux.
- Fixed Pool Size: The shared memory pool size is fixed at registration time.
Benchmarks
Using shared memory significantly outperforms standard gRPC for medium to large payloads.
| Model Type | Batch Size | Standard gRPC (MB/s) | SHM Client (MB/s) | Speedup |
|---|---|---|---|---|
| Large | 2 | 89.86 | 3025.43 | ~33x |
| Normal | 8 | 560.87 | 1282.82 | ~2.3x |
| Multi-IO | 8 | 592.53 | 1354.45 | ~2.3x |
| Identity | 8 | 18.59 | 16.41 | ~0.9x |
[!TIP] For very small payloads (like the Identity model), the overhead of managing shared memory might slightly outweigh the benefits.
How it Works
Standard Triton clients send data over the network (even localhost). This involves:
- Serializing numpy arrays to bytes.
- Sending bytes over a socket.
- Triton deserializing bytes.
- (And the reverse for outputs).
Triton SHM Client optimizes this:
- Pre-allocation: On startup (
register_shm_model), it allocates a large block of System Shared Memory. - Slotting: This block is divided into "slots". Each slot is a pre-calculated memory region big enough to hold one full batch of inputs and outputs.
- Direct Access: When you call
infer_shm, the client writes your numpy data directly into a free slot's memory address. - Pointer Passing: It sends a tiny gRPC message to Triton saying "Read inputs from memory address X, write outputs to address Y".
- Zero-Copy Read: Triton reads directly from RAM, processes, and writes back to RAM.
- Result: The client returns a numpy view of the output memory region.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file triton_shm_client-0.1.0.tar.gz.
File metadata
- Download URL: triton_shm_client-0.1.0.tar.gz
- Upload date:
- Size: 12.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c23754a452404ec1be7b106505d8c8709eaf64edb4c4ad65d8ddc58deb05edb
|
|
| MD5 |
a4863fe7d04db01c70599a5a24f23412
|
|
| BLAKE2b-256 |
f8b62eafce4565b4fc79b49b25be0d0d7da87be6bc45aa763acd94933b131984
|
Provenance
The following attestation bundles were made for triton_shm_client-0.1.0.tar.gz:
Publisher:
publish-to-pypy.yml on Armaggheddon/triton_shm_client
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
triton_shm_client-0.1.0.tar.gz -
Subject digest:
2c23754a452404ec1be7b106505d8c8709eaf64edb4c4ad65d8ddc58deb05edb - Sigstore transparency entry: 750947518
- Sigstore integration time:
-
Permalink:
Armaggheddon/triton_shm_client@0d853575df21d353954e3c185bf284d966718a3e -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Armaggheddon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypy.yml@0d853575df21d353954e3c185bf284d966718a3e -
Trigger Event:
push
-
Statement type:
File details
Details for the file triton_shm_client-0.1.0-py3-none-any.whl.
File metadata
- Download URL: triton_shm_client-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
050d11b79ff461b48686b8403e4719dd0384588f5cdd1fe438c05613e37f19cf
|
|
| MD5 |
8952d6d983bb13810143384bba5bbbea
|
|
| BLAKE2b-256 |
23424f9f90b8be94db50794a98dfe4eac9bd56161ab81f477fd17b187e059e2e
|
Provenance
The following attestation bundles were made for triton_shm_client-0.1.0-py3-none-any.whl:
Publisher:
publish-to-pypy.yml on Armaggheddon/triton_shm_client
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
triton_shm_client-0.1.0-py3-none-any.whl -
Subject digest:
050d11b79ff461b48686b8403e4719dd0384588f5cdd1fe438c05613e37f19cf - Sigstore transparency entry: 750947570
- Sigstore integration time:
-
Permalink:
Armaggheddon/triton_shm_client@0d853575df21d353954e3c185bf284d966718a3e -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Armaggheddon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypy.yml@0d853575df21d353954e3c185bf284d966718a3e -
Trigger Event:
push
-
Statement type: