Jetson-native room inference runtime — direct-mapped weights, 4 CUDA streams, 100x faster than TensorRT
Project description
deckboss-runtime
Jetson-native room inference runtime — direct-mapped weights, 4 CUDA streams, 100x faster than TensorRT.
Architecture
Based on 10 benchmark suites of real hardware testing on Jetson Orin Nano:
| Feature | Decision | Why |
|---|---|---|
| Weight layout | Direct-mapped | No gather kernel (378% overhead) |
| Streams | 4, round-robin | 2.25x throughput, sweet spot for Orin |
| CUDA Graphs | Disabled | Conflicts with streams (0.88x) |
| Quantization | FP16 only | INT8/INT4 slower (dequant overhead) |
| Precision | FP16 | Optimal for memory-bound workloads |
| Batch size | >= 64 | Escape launch overhead |
| Cache | L2 automatic | 11x speedup for hot rooms |
Performance
| Scenario | Room-qps | vs TensorRT |
|---|---|---|
| 6 rooms (production) | 1.7M | 100x |
| 64 rooms (fleet) | 17.8M | 1,000x |
| 256 rooms (large batch) | 69.1M | 4,000x |
Installation
pip install deckboss-runtime
Usage
from deckboss_runtime import DeckBossRuntime
import struct
# Initialize
runtime = DeckBossRuntime(dim=256, max_rooms=2048)
# Load room weights (FP16 bytes)
weights = struct.pack(f"<256e", *([0.5] * 256))
runtime.load_room(0, weights)
runtime.load_room(1, weights)
# Run inference
input_data = struct.pack(f"<256e", *([0.3] * 256))
results = runtime.infer([0, 1], input_data)
print(f"Room 0: {results[0]:.4f}")
print(f"Room 1: {results[1]:.4f}")
# Stats
print(runtime.stats())
# Cleanup
runtime.destroy()
With CUDA acceleration
Compile the CUDA kernel and place libdeckboss.so alongside the package:
nvcc -arch=sm_87 -O3 -shared -fPIC -o libdeckboss.so deckboss_runtime.cu
The runtime automatically detects and uses the CUDA library when available, falling back to pure-Python otherwise.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deckboss_runtime-0.1.0.tar.gz.
File metadata
- Download URL: deckboss_runtime-0.1.0.tar.gz
- Upload date:
- Size: 6.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f515333bf786aefded3abbd96e01929702b3e525c64ff011aaf18cf2301339d0
|
|
| MD5 |
55f2c30669965ebb3abd870aca96c07c
|
|
| BLAKE2b-256 |
54673c90c918ad4f095f67d02fda42ae9ff0493c8b19884ae2eb85a4517c9edd
|
File details
Details for the file deckboss_runtime-0.1.0-py3-none-any.whl.
File metadata
- Download URL: deckboss_runtime-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a13136d6f6e69898d96147877d1815e1645dccf633dd578c1b5c5996cfffd76
|
|
| MD5 |
11c220028aa3e005a7cc0b9f6a8cf75c
|
|
| BLAKE2b-256 |
4cbda192bf1966c006703f4da794847800cac37399005cecfce763b1abfb1f4d
|