Python port of DotCompute's Ring Kernel System - GPU-native actor model with persistent kernels and message passing
Project description
PyDotCompute
A Python port of DotCompute's Ring Kernel System - a GPU-native actor model with persistent kernels and message passing.
Overview
PyDotCompute brings GPU-native actor model capabilities to Python, enabling developers to create persistent GPU kernels that communicate through message queues. This approach is ideal for:
- Real-time GPU compute pipelines
- Streaming data processing on GPU
- Actor-based GPU programming
- High-throughput message-driven architectures
Performance Highlights
| Metric | Value |
|---|---|
| Message latency (p50) | 21μs |
| Message latency (p99) | 131μs |
| GPU graph processing | 1.7M edges/sec |
| Actor throughput | 76K msg/sec |
| Cython queue ops | 0.33μs |
Benchmarked with uvloop on Linux. See Benchmarks for details.
Features
- Ring Kernel System: Persistent GPU kernels with infinite loops and message queues
- High Performance: uvloop auto-installation for 21μs message latency
- Message Passing: Type-safe message serialization with msgpack
- Unified Memory: Transparent host-device memory management with lazy synchronization
- Lifecycle Management: Two-phase launch (launch -> activate) with graceful shutdown
- Telemetry: Real-time GPU monitoring and kernel performance metrics
- Backend Support: CPU simulation and CUDA acceleration via Numba/CuPy
- Performance Tiers: From uvloop (default) to Cython extensions
Installation
# Basic installation (CPU only)
pip install pydotcompute
# With CUDA support
pip install pydotcompute[cuda]
# With performance optimizations (uvloop - Linux/macOS)
pip install pydotcompute[fast]
# With Cython extensions (maximum performance)
pip install pydotcompute[cython]
python setup_cython.py build_ext --inplace
# Development installation
pip install -e ".[dev]"
Quick Start
import asyncio
from pydotcompute import RingKernelRuntime, ring_kernel, message
# Define message types
@message
class ComputeRequest:
values: list[float]
@message
class ComputeResponse:
result: float
# Define a ring kernel actor
@ring_kernel(
kernel_id="compute",
input_type=ComputeRequest,
output_type=ComputeResponse,
)
async def compute_actor(ctx):
while not ctx.should_terminate:
msg = await ctx.receive()
result = sum(msg.values)
await ctx.send(ComputeResponse(result=result))
# Use the runtime (automatically uses uvloop for best performance)
async def main():
async with RingKernelRuntime() as runtime:
await runtime.launch("compute")
await runtime.activate("compute")
await runtime.send("compute", ComputeRequest(values=[1.0, 2.0, 3.0]))
response = await runtime.receive("compute")
print(f"Result: {response.result}") # 6.0
asyncio.run(main())
Performance Tiers
PyDotCompute offers three performance tiers to match your use case:
| Tier | Implementation | Latency (p50) | Use Case |
|---|---|---|---|
| 1 (Default) | uvloop + FastMessageQueue | 21μs | Async Python code |
| 2 | ThreadedRingKernel | ~100μs | Blocking I/O, C extensions |
| 3 | CythonRingKernel | 0.33μs queue ops | Multi-process IPC |
Tier 1: Async (Default)
Automatically enabled when you import pydotcompute. Uses uvloop on Linux/macOS.
async with RingKernelRuntime() as runtime:
# uvloop is auto-installed for 21μs latency
await runtime.launch("my_kernel")
await runtime.activate("my_kernel")
Tier 2: Threaded
For blocking operations or GIL-releasing C extensions:
from pydotcompute.ring_kernels import ThreadedRingKernel, ThreadedKernelContext
def blocking_kernel(ctx: ThreadedKernelContext):
while not ctx.should_terminate:
msg = ctx.receive(timeout=0.1)
if msg:
ctx.send(process(msg))
with ThreadedRingKernel("worker", blocking_kernel) as kernel:
kernel.send(request)
response = kernel.receive()
Tier 3: Cython (Maximum Performance)
For multi-process scenarios or Cython extensions:
from pydotcompute.ring_kernels import CythonRingKernel, is_cython_kernel_available
if is_cython_kernel_available():
# 0.33μs queue operations
with CythonRingKernel("fast_worker", my_kernel) as kernel:
kernel.send(request)
Benchmarks
Message Latency
GPU Actors (1000 samples):
p50: 63μs
p95: 103μs
p99: 131μs
mean: 70μs
Graph Processing (PageRank)
| Graph Size | CPU Sparse | GPU Batch | Speedup |
|---|---|---|---|
| 1K nodes | 6.8ms | 64ms | CPU wins |
| 5K nodes (dense) | 256ms | 200ms | GPU 1.28x |
| 1M nodes | 39.6s | 4.25s | GPU 9.3x |
Crossover: GPU wins at 50K+ nodes
Streaming Throughput
| Scenario | GPU Actors | Advantage |
|---|---|---|
| Persistent state | Yes | No repeated GPU transfers |
| Transfer overhead | 0% | vs 16-28% for batch |
| Best for | Long-running pipelines | Context preservation |
Architecture
PyDotCompute Ring Kernel System
├── Ring Kernels │ Performance Tiers │ CUDA Backend
│ • RingKernelRuntime │ • uvloop (21μs) │ • Numba JIT
│ • FastMessageQueue │ • ThreadedRingKernel │ • CuPy arrays
│ • @ring_kernel │ • CythonRingKernel │ • Zero-copy DMA
│ • @message │ • FastSPSCQueue │ • PTX caching
├─────────────────────────┴────────────────────────┴─────────────────
│ Memory: UnifiedBuffer, MemoryPool, Accelerator
Core Components
UnifiedBuffer
Transparent host-device memory management:
from pydotcompute import UnifiedBuffer
import numpy as np
buffer = UnifiedBuffer((1000,), dtype=np.float32)
buffer.allocate()
# Write on host
buffer.host[:] = np.random.randn(1000)
buffer.mark_host_dirty()
# Access on device (auto-syncs)
await buffer.ensure_on_device()
device_data = buffer.device
Ring Kernels
Persistent actors with message queues:
@ring_kernel(kernel_id="processor", queue_size=4096)
async def processor(ctx):
while not ctx.should_terminate:
msg = await ctx.receive(timeout=0.1)
# Process message
await ctx.send(response)
Lifecycle Management
async with RingKernelRuntime() as runtime:
# Phase 1: Launch (allocate resources)
await runtime.launch("my_kernel")
# Phase 2: Activate (start processing)
await runtime.activate("my_kernel")
# Use the kernel...
# Deactivate (pause) or Terminate (cleanup)
await runtime.deactivate("my_kernel")
await runtime.reactivate("my_kernel")
Project Structure
pydotcompute/
├── core/
│ ├── accelerator.py # GPU device abstraction
│ ├── unified_buffer.py # Host-device memory
│ ├── memory_pool.py # Memory pooling
│ └── orchestrator.py # Compute coordination
├── ring_kernels/
│ ├── runtime.py # Main runtime (uvloop)
│ ├── message.py # Message serialization
│ ├── queue.py # Async message queues
│ ├── fast_queue.py # O(1) priority queue
│ ├── lifecycle.py # Kernel lifecycle
│ ├── telemetry.py # Performance monitoring
│ ├── _loop.py # uvloop auto-install
│ ├── sync_queue.py # Threading queues
│ ├── threaded_kernel.py # Tier 2 kernel
│ ├── cython_kernel.py # Tier 3 kernel
│ └── _cython/ # Cython extensions
│ └── fast_spsc.pyx # 0.33μs queue
├── backends/
│ ├── cpu.py # CPU simulation
│ └── cuda.py # CUDA via Numba/CuPy
├── compilation/
│ ├── compiler.py # Kernel compilation
│ └── cache.py # PTX caching
└── decorators/
├── kernel.py # @kernel decorator
├── ring_kernel.py # @ring_kernel decorator
└── validators.py # Runtime validation
Testing
# Run all tests (398 passing)
pytest
# Run with coverage
pytest --cov=pydotcompute
# Run only unit tests
pytest tests/unit/
# Skip CUDA tests (if no GPU)
pytest -m "not cuda"
# Run benchmarks
python benchmarks/extended_benchmark.py
python benchmarks/pagerank_benchmark.py
python benchmarks/realtime_anomaly_benchmark.py
Requirements
- Python >= 3.11
- numpy >= 1.26.0
- msgpack >= 1.0.0
Optional Dependencies
| Package | Purpose |
|---|---|
| uvloop | 20-40% faster event loop (Linux/macOS) |
| cupy-cuda12x | CUDA array operations |
| numba | GPU kernel JIT compilation |
| pynvml | GPU monitoring |
| cython | Maximum performance queues |
Disabling uvloop
If you need to disable uvloop auto-installation:
PYDOTCOMPUTE_NO_UVLOOP=1 python my_script.py
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines and docs/IMPLEMENTATION_PLAN.md for the project roadmap.
License
Apache License 2.0 - see LICENSE file for details.
Related
- DotCompute - Original .NET implementation
- Numba CUDA - Python CUDA JIT
- CuPy - NumPy-compatible GPU arrays
- uvloop - Fast asyncio event loop
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydotcompute-0.1.0.tar.gz.
File metadata
- Download URL: pydotcompute-0.1.0.tar.gz
- Upload date:
- Size: 289.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
040f9b8980b8085042146de62f5d31ab64a5d41438ff32d26bae4f23a8d9150b
|
|
| MD5 |
1f290ebfa9002f6c782603e6f51844ac
|
|
| BLAKE2b-256 |
11741d356f5458fe8b9c218338d74253d55601b3c5d5bb3e04cdb2b3aab22bf4
|
File details
Details for the file pydotcompute-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pydotcompute-0.1.0-py3-none-any.whl
- Upload date:
- Size: 167.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
909de25b7e5b8e0da51c8938f93e28c06c65f7ddb2fbdc15485cb727884b624b
|
|
| MD5 |
118709990a0119c40e8be39883f5afe1
|
|
| BLAKE2b-256 |
bf03261a4f600379c560b1deb7e64a657691ca06e41c75d8a90e645db46ec167
|