Skip to main content

Easier Shared Memory management for Triton Inference Client

Project description

Triton Shared Memory Client

Python CI PyPI version PyPI - Python Version License: MIT

A high-performance Python client for Triton Inference Server that simplifies the use of Shared Memory (SHM) for zero-copy inference.

This library is designed for scenarios where the client and the Triton server are colocated on the same machine. By using system shared memory, it avoids the overhead of serializing and deserializing tensors over gRPC/HTTP, resulting in massive throughput improvements for large data transfers.

Installation

pip install triton-shm-client

How to Use

The library exposes a TritonSHMClient that wraps the standard gRPC client but adds shared memory management capabilities.

  1. Initialize the client.
  2. Register your model: Define inputs, outputs, and allocate a shared memory pool.
  3. Run inference: Pass your numpy arrays directly.
import numpy as np
from triton_shm_client import TritonSHMClient

# 1. Connect to Triton
client = TritonSHMClient(url="localhost:8001")

# 2. Register the model and allocate shared memory
#    This pre-allocates a pool of 4 slots, each capable of handling a batch of 8.
client.register_shm_model(
    model_name="my_model",
    inputs=[
        ("INPUT0", (3, 224, 224), np.float32),
        ("INPUT1", (10,), np.int32)
    ],
    outputs=[
        ("OUTPUT0", (512,), np.float32),
        ("OUTPUT1", (5,), np.int32)
    ],
    max_batch_size=8,
    pool_size=4
)

# 3. Run inference
#    The client automatically handles copying data to SHM and retrieving results.
#    If the input size exceeds max_batch_size, it will be automatically chunked.
#    If the pool is full, this call will block until a slot becomes available.
results = client.infer_shm(
    model_name="my_model",
    inputs={
        "INPUT0": np.random.randn(100, 3, 224, 224).astype(np.float32),
        "INPUT1": np.random.randint(0, 100, size=(100, 10)).astype(np.int32)
    }
)

# results is a Dict[str, np.ndarray]
print("Output shape:", results["OUTPUT0"].shape)

Features

  • Zero-Copy Inference: Uses multiprocessing.shared_memory to pass data to Triton without network overhead.
  • Automatic Pool Management: Handles the complexity of allocating, registering, and cleaning up shared memory regions.
  • Transparent Batching: Seamlessly handles inputs larger than the model's max_batch_size by chunking requests into smaller batches and reassembling the results.
  • Blocking Flow Control: If the shared memory pool is full, inference requests automatically block until a slot is free, providing simple backpressure.
  • NumPy Integration: Native support for NumPy arrays for both inputs and outputs.
  • Drop-in Replacement: Extends the standard InferenceServerClient, so you can still use standard gRPC methods if needed.
  • Automatic Cleanup: Registers atexit handlers to ensure shared memory regions are unlinked even if the script exits unexpectedly.

Limitations

  • Local Only: The Triton Inference Server must be running on the same machine as the client, as they share system memory.
  • Linux Only: Currently tested and supported primarily on Linux.
  • Fixed Pool Size: The shared memory pool size is fixed at registration time.

Benchmarks

Using shared memory significantly outperforms standard gRPC for medium to large payloads.

Model Type Batch Size Standard gRPC (MB/s) SHM Client (MB/s) Speedup
Large 2 89.86 3025.43 ~33x
Normal 8 560.87 1282.82 ~2.3x
Multi-IO 8 592.53 1354.45 ~2.3x
Identity 8 18.59 16.41 ~0.9x

[!TIP] For very small payloads (like the Identity model), the overhead of managing shared memory might slightly outweigh the benefits.

How it Works

Standard Triton clients send data over the network (even localhost). This involves:

  1. Serializing numpy arrays to bytes.
  2. Sending bytes over a socket.
  3. Triton deserializing bytes.
  4. (And the reverse for outputs).

Triton SHM Client optimizes this:

  1. Pre-allocation: On startup (register_shm_model), it allocates a large block of System Shared Memory.
  2. Slotting: This block is divided into "slots". Each slot is a pre-calculated memory region big enough to hold one full batch of inputs and outputs.
  3. Direct Access: When you call infer_shm, the client writes your numpy data directly into a free slot's memory address.
  4. Pointer Passing: It sends a tiny gRPC message to Triton saying "Read inputs from memory address X, write outputs to address Y".
  5. Zero-Copy Read: Triton reads directly from RAM, processes, and writes back to RAM.
  6. Result: The client returns a numpy view of the output memory region.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

triton_shm_client-0.1.0.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

triton_shm_client-0.1.0-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file triton_shm_client-0.1.0.tar.gz.

File metadata

  • Download URL: triton_shm_client-0.1.0.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for triton_shm_client-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2c23754a452404ec1be7b106505d8c8709eaf64edb4c4ad65d8ddc58deb05edb
MD5 a4863fe7d04db01c70599a5a24f23412
BLAKE2b-256 f8b62eafce4565b4fc79b49b25be0d0d7da87be6bc45aa763acd94933b131984

See more details on using hashes here.

Provenance

The following attestation bundles were made for triton_shm_client-0.1.0.tar.gz:

Publisher: publish-to-pypy.yml on Armaggheddon/triton_shm_client

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file triton_shm_client-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for triton_shm_client-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 050d11b79ff461b48686b8403e4719dd0384588f5cdd1fe438c05613e37f19cf
MD5 8952d6d983bb13810143384bba5bbbea
BLAKE2b-256 23424f9f90b8be94db50794a98dfe4eac9bd56161ab81f477fd17b187e059e2e

See more details on using hashes here.

Provenance

The following attestation bundles were made for triton_shm_client-0.1.0-py3-none-any.whl:

Publisher: publish-to-pypy.yml on Armaggheddon/triton_shm_client

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page