Easier Shared Memory management for Triton Inference Client

Project description

Triton Shared Memory Client

A high-performance Python client for Triton Inference Server that simplifies the use of Shared Memory (SHM) for zero-copy inference.

This library is designed for scenarios where the client and the Triton server are colocated on the same machine. By using system shared memory, it avoids the overhead of serializing and deserializing tensors over gRPC/HTTP, resulting in massive throughput improvements for large data transfers.

Installation

pip install triton-shm-client

How to Use

The library exposes a TritonSHMClient that wraps the standard gRPC client but adds shared memory management capabilities.

Initialize the client.
Register your model: Define inputs, outputs, and allocate a shared memory pool.
Run inference: Pass your numpy arrays directly.

import numpy as np
from triton_shm_client import TritonSHMClient

# 1. Connect to Triton
client = TritonSHMClient(url="localhost:8001")

# 2. Register the model and allocate shared memory
#    This pre-allocates a pool of 4 slots, each capable of handling a batch of 8.
client.register_shm_model(
    model_name="my_model",
    inputs=[
        ("INPUT0", (3, 224, 224), np.float32),
        ("INPUT1", (10,), np.int32)
    ],
    outputs=[
        ("OUTPUT0", (512,), np.float32),
        ("OUTPUT1", (5,), np.int32)
    ],
    max_batch_size=8,
    pool_size=4
)

# 3. Run inference
#    The client automatically handles copying data to SHM and retrieving results.
#    If the input size exceeds max_batch_size, it will be automatically chunked.
#    If the pool is full, this call will block until a slot becomes available.
results = client.infer_shm(
    model_name="my_model",
    inputs={
        "INPUT0": np.random.randn(100, 3, 224, 224).astype(np.float32),
        "INPUT1": np.random.randint(0, 100, size=(100, 10)).astype(np.int32)
    }
)

# results is a Dict[str, np.ndarray]
print("Output shape:", results["OUTPUT0"].shape)

Features

Zero-Copy Inference: Uses multiprocessing.shared_memory to pass data to Triton without network overhead.
Automatic Pool Management: Handles the complexity of allocating, registering, and cleaning up shared memory regions.
Transparent Batching: Seamlessly handles inputs larger than the model's max_batch_size by chunking requests into smaller batches and reassembling the results.
Blocking Flow Control: If the shared memory pool is full, inference requests automatically block until a slot is free, providing simple backpressure.
NumPy Integration: Native support for NumPy arrays for both inputs and outputs.
Drop-in Replacement: Extends the standard InferenceServerClient, so you can still use standard gRPC methods if needed.
Automatic Cleanup: Registers atexit handlers to ensure shared memory regions are unlinked even if the script exits unexpectedly.

Limitations

Local Only: The Triton Inference Server must be running on the same machine as the client, as they share system memory.
Linux Only: Currently tested and supported primarily on Linux.
Fixed Pool Size: The shared memory pool size is fixed at registration time.

Benchmarks

Using shared memory significantly outperforms standard gRPC for medium to large payloads.

Model Type	Batch Size	Standard gRPC (MB/s)	SHM Client (MB/s)	Speedup
Large	2	89.86	3025.43	~33x
Normal	8	560.87	1282.82	~2.3x
Multi-IO	8	592.53	1354.45	~2.3x
Identity	8	18.59	16.41	~0.9x

[!TIP] For very small payloads (like the Identity model), the overhead of managing shared memory might slightly outweigh the benefits.

How it Works

Standard Triton clients send data over the network (even localhost). This involves:

Serializing numpy arrays to bytes.
Sending bytes over a socket.
Triton deserializing bytes.
(And the reverse for outputs).

Triton SHM Client optimizes this:

Pre-allocation: On startup (register_shm_model), it allocates a large block of System Shared Memory.
Slotting: This block is divided into "slots". Each slot is a pre-calculated memory region big enough to hold one full batch of inputs and outputs.
Direct Access: When you call infer_shm, the client writes your numpy data directly into a free slot's memory address.
Pointer Passing: It sends a tiny gRPC message to Triton saying "Read inputs from memory address X, write outputs to address Y".
Zero-Copy Read: Triton reads directly from RAM, processes, and writes back to RAM.
Result: The client returns a numpy view of the output memory region.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Dec 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

triton_shm_client-0.1.0.tar.gz (12.5 kB view details)

Uploaded Dec 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

triton_shm_client-0.1.0-py3-none-any.whl (11.3 kB view details)

Uploaded Dec 8, 2025 Python 3

File details

Details for the file triton_shm_client-0.1.0.tar.gz.

File metadata

Download URL: triton_shm_client-0.1.0.tar.gz
Upload date: Dec 8, 2025
Size: 12.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for triton_shm_client-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2c23754a452404ec1be7b106505d8c8709eaf64edb4c4ad65d8ddc58deb05edb`
MD5	`a4863fe7d04db01c70599a5a24f23412`
BLAKE2b-256	`f8b62eafce4565b4fc79b49b25be0d0d7da87be6bc45aa763acd94933b131984`

See more details on using hashes here.

Provenance

The following attestation bundles were made for triton_shm_client-0.1.0.tar.gz:

Publisher: publish-to-pypy.yml on Armaggheddon/triton_shm_client

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: triton_shm_client-0.1.0.tar.gz
- Subject digest: 2c23754a452404ec1be7b106505d8c8709eaf64edb4c4ad65d8ddc58deb05edb
- Sigstore transparency entry: 750947518
- Sigstore integration time: Dec 8, 2025
Source repository:
- Permalink: Armaggheddon/triton_shm_client@0d853575df21d353954e3c185bf284d966718a3e
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Armaggheddon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypy.yml@0d853575df21d353954e3c185bf284d966718a3e
- Trigger Event: push

File details

Details for the file triton_shm_client-0.1.0-py3-none-any.whl.

File metadata

Download URL: triton_shm_client-0.1.0-py3-none-any.whl
Upload date: Dec 8, 2025
Size: 11.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for triton_shm_client-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`050d11b79ff461b48686b8403e4719dd0384588f5cdd1fe438c05613e37f19cf`
MD5	`8952d6d983bb13810143384bba5bbbea`
BLAKE2b-256	`23424f9f90b8be94db50794a98dfe4eac9bd56161ab81f477fd17b187e059e2e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for triton_shm_client-0.1.0-py3-none-any.whl:

Publisher: publish-to-pypy.yml on Armaggheddon/triton_shm_client

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: triton_shm_client-0.1.0-py3-none-any.whl
- Subject digest: 050d11b79ff461b48686b8403e4719dd0384588f5cdd1fe438c05613e37f19cf
- Sigstore transparency entry: 750947570
- Sigstore integration time: Dec 8, 2025
Source repository:
- Permalink: Armaggheddon/triton_shm_client@0d853575df21d353954e3c185bf284d966718a3e
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Armaggheddon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypy.yml@0d853575df21d353954e3c185bf284d966718a3e
- Trigger Event: push

triton-shm-client 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Triton Shared Memory Client

Installation

How to Use

Features

Limitations

Benchmarks

How it Works

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance