Python port of DotCompute's Ring Kernel System - GPU-native actor model with persistent kernels and message passing

These details have not been verified by PyPI

Project links

Project description

PyDotCompute

A Python port of DotCompute's Ring Kernel System - a GPU-native actor model with persistent kernels and message passing.

Overview

PyDotCompute brings GPU-native actor model capabilities to Python, enabling developers to create persistent GPU kernels that communicate through message queues. This approach is ideal for:

Real-time GPU compute pipelines
Streaming data processing on GPU
Actor-based GPU programming
High-throughput message-driven architectures

Performance Highlights

Metric	Value
Message latency (p50)	21μs
Message latency (p99)	131μs
GPU graph processing	1.7M edges/sec
Actor throughput	76K msg/sec
Cython queue ops	0.33μs

Benchmarked with uvloop on Linux. See Benchmarks for details.

Features

Ring Kernel System: Persistent GPU kernels with infinite loops and message queues
High Performance: uvloop auto-installation for 21μs message latency
Message Passing: Type-safe message serialization with msgpack
Unified Memory: Transparent host-device memory management with lazy synchronization
Lifecycle Management: Two-phase launch (launch -> activate) with graceful shutdown
Telemetry: Real-time GPU monitoring and kernel performance metrics
Backend Support: CPU simulation and CUDA acceleration via Numba/CuPy
Performance Tiers: From uvloop (default) to Cython extensions

Installation

# Basic installation (CPU only)
pip install pydotcompute

# With CUDA support
pip install pydotcompute[cuda]

# With performance optimizations (uvloop - Linux/macOS)
pip install pydotcompute[fast]

# With Cython extensions (maximum performance)
pip install pydotcompute[cython]
python setup_cython.py build_ext --inplace

# Development installation
pip install -e ".[dev]"

Quick Start

import asyncio
from pydotcompute import RingKernelRuntime, ring_kernel, message

# Define message types
@message
class ComputeRequest:
    values: list[float]

@message
class ComputeResponse:
    result: float

# Define a ring kernel actor
@ring_kernel(
    kernel_id="compute",
    input_type=ComputeRequest,
    output_type=ComputeResponse,
)
async def compute_actor(ctx):
    while not ctx.should_terminate:
        msg = await ctx.receive()
        result = sum(msg.values)
        await ctx.send(ComputeResponse(result=result))

# Use the runtime (automatically uses uvloop for best performance)
async def main():
    async with RingKernelRuntime() as runtime:
        await runtime.launch("compute")
        await runtime.activate("compute")

        await runtime.send("compute", ComputeRequest(values=[1.0, 2.0, 3.0]))
        response = await runtime.receive("compute")

        print(f"Result: {response.result}")  # 6.0

asyncio.run(main())

Performance Tiers

PyDotCompute offers three performance tiers to match your use case:

Tier	Implementation	Latency (p50)	Use Case
1 (Default)	uvloop + FastMessageQueue	21μs	Async Python code
2	ThreadedRingKernel	~100μs	Blocking I/O, C extensions
3	CythonRingKernel	0.33μs queue ops	Multi-process IPC

Tier 1: Async (Default)

Automatically enabled when you import pydotcompute. Uses uvloop on Linux/macOS.

async with RingKernelRuntime() as runtime:
    # uvloop is auto-installed for 21μs latency
    await runtime.launch("my_kernel")
    await runtime.activate("my_kernel")

Tier 2: Threaded

For blocking operations or GIL-releasing C extensions:

from pydotcompute.ring_kernels import ThreadedRingKernel, ThreadedKernelContext

def blocking_kernel(ctx: ThreadedKernelContext):
    while not ctx.should_terminate:
        msg = ctx.receive(timeout=0.1)
        if msg:
            ctx.send(process(msg))

with ThreadedRingKernel("worker", blocking_kernel) as kernel:
    kernel.send(request)
    response = kernel.receive()

Tier 3: Cython (Maximum Performance)

For multi-process scenarios or Cython extensions:

from pydotcompute.ring_kernels import CythonRingKernel, is_cython_kernel_available

if is_cython_kernel_available():
    # 0.33μs queue operations
    with CythonRingKernel("fast_worker", my_kernel) as kernel:
        kernel.send(request)

Benchmarks

Message Latency

GPU Actors (1000 samples):
  p50:  63μs
  p95:  103μs
  p99:  131μs
  mean: 70μs

Graph Processing (PageRank)

Graph Size	CPU Sparse	GPU Batch	Speedup
1K nodes	6.8ms	64ms	CPU wins
5K nodes (dense)	256ms	200ms	GPU 1.28x
1M nodes	39.6s	4.25s	GPU 9.3x

Crossover: GPU wins at 50K+ nodes

Streaming Throughput

Scenario	GPU Actors	Advantage
Persistent state	Yes	No repeated GPU transfers
Transfer overhead	0%	vs 16-28% for batch
Best for	Long-running pipelines	Context preservation

Architecture

PyDotCompute Ring Kernel System
├── Ring Kernels          │ Performance Tiers      │ CUDA Backend
│   • RingKernelRuntime   │ • uvloop (21μs)        │ • Numba JIT
│   • FastMessageQueue    │ • ThreadedRingKernel   │ • CuPy arrays
│   • @ring_kernel        │ • CythonRingKernel     │ • Zero-copy DMA
│   • @message            │ • FastSPSCQueue        │ • PTX caching
├─────────────────────────┴────────────────────────┴─────────────────
│ Memory: UnifiedBuffer, MemoryPool, Accelerator

Core Components

UnifiedBuffer

Transparent host-device memory management:

from pydotcompute import UnifiedBuffer
import numpy as np

buffer = UnifiedBuffer((1000,), dtype=np.float32)
buffer.allocate()

# Write on host
buffer.host[:] = np.random.randn(1000)
buffer.mark_host_dirty()

# Access on device (auto-syncs)
await buffer.ensure_on_device()
device_data = buffer.device

Ring Kernels

Persistent actors with message queues:

@ring_kernel(kernel_id="processor", queue_size=4096)
async def processor(ctx):
    while not ctx.should_terminate:
        msg = await ctx.receive(timeout=0.1)
        # Process message
        await ctx.send(response)

Lifecycle Management

async with RingKernelRuntime() as runtime:
    # Phase 1: Launch (allocate resources)
    await runtime.launch("my_kernel")

    # Phase 2: Activate (start processing)
    await runtime.activate("my_kernel")

    # Use the kernel...

    # Deactivate (pause) or Terminate (cleanup)
    await runtime.deactivate("my_kernel")
    await runtime.reactivate("my_kernel")

Project Structure

pydotcompute/
├── core/
│   ├── accelerator.py      # GPU device abstraction
│   ├── unified_buffer.py   # Host-device memory
│   ├── memory_pool.py      # Memory pooling
│   └── orchestrator.py     # Compute coordination
├── ring_kernels/
│   ├── runtime.py          # Main runtime (uvloop)
│   ├── message.py          # Message serialization
│   ├── queue.py            # Async message queues
│   ├── fast_queue.py       # O(1) priority queue
│   ├── lifecycle.py        # Kernel lifecycle
│   ├── telemetry.py        # Performance monitoring
│   ├── _loop.py            # uvloop auto-install
│   ├── sync_queue.py       # Threading queues
│   ├── threaded_kernel.py  # Tier 2 kernel
│   ├── cython_kernel.py    # Tier 3 kernel
│   └── _cython/            # Cython extensions
│       └── fast_spsc.pyx   # 0.33μs queue
├── backends/
│   ├── cpu.py              # CPU simulation
│   └── cuda.py             # CUDA via Numba/CuPy
├── compilation/
│   ├── compiler.py         # Kernel compilation
│   └── cache.py            # PTX caching
└── decorators/
    ├── kernel.py           # @kernel decorator
    ├── ring_kernel.py      # @ring_kernel decorator
    └── validators.py       # Runtime validation

Testing

# Run all tests (398 passing)
pytest

# Run with coverage
pytest --cov=pydotcompute

# Run only unit tests
pytest tests/unit/

# Skip CUDA tests (if no GPU)
pytest -m "not cuda"

# Run benchmarks
python benchmarks/extended_benchmark.py
python benchmarks/pagerank_benchmark.py
python benchmarks/realtime_anomaly_benchmark.py

Requirements

Python >= 3.11
numpy >= 1.26.0
msgpack >= 1.0.0

Optional Dependencies

Package	Purpose
uvloop	20-40% faster event loop (Linux/macOS)
cupy-cuda12x	CUDA array operations
numba	GPU kernel JIT compilation
pynvml	GPU monitoring
cython	Maximum performance queues

Disabling uvloop

If you need to disable uvloop auto-installation:

PYDOTCOMPUTE_NO_UVLOOP=1 python my_script.py

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines and docs/IMPLEMENTATION_PLAN.md for the project roadmap.

License

Apache License 2.0 - see LICENSE file for details.

DotCompute - Original .NET implementation
Numba CUDA - Python CUDA JIT
CuPy - NumPy-compatible GPU arrays
uvloop - Fast asyncio event loop

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Nov 28, 2025

This version

0.1.0

Nov 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydotcompute-0.1.0.tar.gz (289.2 kB view details)

Uploaded Nov 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydotcompute-0.1.0-py3-none-any.whl (167.1 kB view details)

Uploaded Nov 25, 2025 Python 3

File details

Details for the file pydotcompute-0.1.0.tar.gz.

File metadata

Download URL: pydotcompute-0.1.0.tar.gz
Upload date: Nov 25, 2025
Size: 289.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pydotcompute-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`040f9b8980b8085042146de62f5d31ab64a5d41438ff32d26bae4f23a8d9150b`
MD5	`1f290ebfa9002f6c782603e6f51844ac`
BLAKE2b-256	`11741d356f5458fe8b9c218338d74253d55601b3c5d5bb3e04cdb2b3aab22bf4`

See more details on using hashes here.

File details

Details for the file pydotcompute-0.1.0-py3-none-any.whl.

File metadata

Download URL: pydotcompute-0.1.0-py3-none-any.whl
Upload date: Nov 25, 2025
Size: 167.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for pydotcompute-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`909de25b7e5b8e0da51c8938f93e28c06c65f7ddb2fbdc15485cb727884b624b`
MD5	`118709990a0119c40e8be39883f5afe1`
BLAKE2b-256	`bf03261a4f600379c560b1deb7e64a657691ca06e41c75d8a90e645db46ec167`

See more details on using hashes here.

pydotcompute 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyDotCompute

Overview

Performance Highlights

Features

Installation

Quick Start

Performance Tiers

Tier 1: Async (Default)

Tier 2: Threaded

Tier 3: Cython (Maximum Performance)

Benchmarks

Message Latency

Graph Processing (PageRank)

Streaming Throughput

Architecture

Core Components

UnifiedBuffer

Ring Kernels

Lifecycle Management

Project Structure

Testing

Requirements

Optional Dependencies

Disabling uvloop

Contributing

License

Related

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes