High-performance Python acceleration engine — CPU, threads, virtual threads, multi-GPU, NPU, ARM/Android/Termux, IoT/SBC and virtualization.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

GuilhermeP96

These details have not been verified by PyPI

Project description

PyAccelerate

High-performance Python acceleration engine — CPU, threads, virtual threads, multi-GPU, NPU, ARM/Android/Termux, IoT/SBC, auto-tuning, Prometheus metrics, HTTP/gRPC server, Kubernetes auto-scaling, and maximum optimization mode.

Features

Module	Description
`cpu`	CPU detection, topology, NUMA, affinity, ISA flags, ARM big.LITTLE/DynamIQ, dynamic worker recommendations
`threads`	Persistent virtual-thread pool, sliding-window executor, async bridge, process pool
`work_stealing`	Work-stealing scheduler (Tokio / Go runtime / ForkJoinPool style) — Chase-Lev deques, random victim selection
`lockfree_queue`	Lock-free MPMC queue & per-worker Chase-Lev deques for minimal contention
`adaptive`	Adaptive scheduler — auto-scales workers based on latency, CPU pressure & memory pressure
`_native`	Optional Cython / Rust (PyO3) accelerators for hot-path data structures
`gpu`	Multi-vendor GPU detection (NVIDIA/CUDA, AMD/OpenCL, Intel oneAPI, ARM Adreno/Mali/Immortalis), architecture classification (Kepler→Blackwell, GCN→RDNA4/CDNA3), CUDA/Tensor/RT core counts, NVENC/NVDEC & VCN encode/decode, clocks, memory type & bandwidth, PCIe info, driver version, shared VRAM for all vendors, Vulkan probing, ranking, multi-GPU dispatch
`npu`	NPU detection & inference (OpenVINO, ONNX Runtime, DirectML, CoreML, ARM Hexagon/Samsung NPU/Tensor TPU/MediaTek APU)
`virt`	Virtualization detection (Hyper-V, VT-x/AMD-V, KVM, WSL2, Docker, container detection)
`memory`	Memory pressure monitoring, automatic worker clamping, reusable buffer pool, GPU memory stats (dedicated + shared VRAM, CUDA/Tensor/RT cores, bandwidth, features)
`profiler`	`@timed`, `@profile_memory` decorators, `Timer` context manager, `Tracker` statistics
`benchmark`	Built-in micro-benchmarks (CPU, threads, memory bandwidth, GPU compute)
`priority`	OS-level task priority (IDLE → REALTIME) & energy profiles (POWER_SAVER → ULTRA_PERFORMANCE)
`max_mode`	Maximum optimization mode — activates ALL resources simultaneously with OS tuning
`android`	Android/Termux platform detection, ARM SoC database (25+ chipsets), big.LITTLE, thermal & battery
`iot`	IoT / SBC detection (Raspberry Pi, Jetson, BeagleBone, Coral, Hailo, 30+ SoCs)
`autotune`	Auto-tuning feedback loop — benchmark → config → re-tune, persistent profiles, hardware-safe limits & config clamping, GPU architecture/features/driver profiling
`metrics`	Prometheus metrics exporter (`/metrics` HTTP endpoint, all subsystems)
`server`	JSON HTTP & gRPC server for multi-language integration (Node.js, Go, Java, etc.)
`k8s`	Kubernetes operator — pod info, GPU node capacity, auto-scaling, manifest generation
`engine`	Unified orchestrator — auto-detects everything and provides a single API

Quick Start

pip install pyaccelerate

from pyaccelerate import Engine

engine = Engine()
print(engine.summary())

# Submit I/O-bound tasks to the virtual thread pool
future = engine.submit(my_io_func, arg1, arg2)

# Run many tasks with auto-tuned concurrency
engine.run_parallel(process_file, [(f,) for f in files])

# GPU dispatch (auto-fallback to CPU)
results = engine.gpu_dispatch(my_kernel, data_chunks)

Benchmark

"Ok... but how much faster is it?" — Here are real numbers.

All benchmarks run on Python 3.12 / Windows with python -m benchmarks.run. IO/CPU/Mixed on 48-core Xeon; NPU on Intel Core Ultra 7 265U (Arrow Lake, 13 TOPS NPU). Reproduce: pip install pyaccelerate && python -m benchmarks.run

IO-bound — 200 simulated HTTP calls (20ms each)

Runner	Time	Speedup	Tasks/sec
Sequential	4.079s	1.0×	49
`ThreadPoolExecutor`	0.118s	34.7×	1,701
`pyaccelerate.engine`	0.193s	21.2×	1,038
`pyaccelerate.threads`	0.150s	27.3×	1,336
`pyaccelerate.ws`	0.240s	17.0×	832

IO-bound — 200 variable-latency calls (5–80ms, realistic)

Runner	Time	Speedup	Tasks/sec
Sequential	8.350s	1.0×	24
`ThreadPoolExecutor`	0.245s	34.1×	817
`pyaccelerate.threads`	0.329s	25.4×	608
`pyaccelerate.ws`	0.369s	22.6×	542
`pyaccelerate.adaptive`	0.656s	12.7×	305

CPU-bound — zlib compress 200KB × 100 (GIL released by C extension)

Runner	Time	Speedup	Tasks/sec
Sequential	0.153s	1.0×	654
`ThreadPoolExecutor`	0.049s	3.1×	2,026
`pyaccelerate.threads`	0.049s	3.1×	2,033
`pyaccelerate.ws`	0.079s	1.9×	1,264

Mixed — IO (10ms) + CPU (SHA-256 × 400) × 200

Runner	Time	Speedup	Tasks/sec
Sequential	2.216s	1.0×	90
`ThreadPoolExecutor`	0.197s	11.3×	1,017
`pyaccelerate.threads`	0.162s	13.7×	1,233
`pyaccelerate.engine`	0.206s	10.8×	972
`pyaccelerate.adaptive`	0.634s	3.5×	315

Key takeaway: pyaccelerate.threads beats ThreadPoolExecutor by 21% on mixed workloads. The work-stealing scheduler (ws) excels at variable-latency IO where load balancing matters most. Run your own benchmarks: python -m benchmarks.run (full) or python -m benchmarks.run --quick (CI).

NPU Inference — ONNX model, NPU (OpenVINO) vs CPU

Tested on Intel Core Ultra 7 265U (Arrow Lake) with Intel AI Boost NPU (13 TOPS) via OpenVINO. Models: multi-layer MatMul→ReLU MLP (1024→1024→512), single-stream inference.

Model	Runner	Avg latency	Speedup	Infer/sec
MLP-8L (shallow)	CPU (ONNX Runtime)	0.75 ms	1.0× (baseline)	1,341
MLP-8L (shallow)	NPU (OpenVINO)	1.66 ms	0.5×	603
MLP-16L	CPU (ONNX Runtime)	1.59 ms	1.0× (baseline)	630
MLP-16L	NPU (OpenVINO)	1.98 ms	0.8×	505
MLP-24L (deep)	CPU (ONNX Runtime)	16.24 ms	1.0× (baseline)	61
MLP-24L (deep)	NPU (OpenVINO)	4.69 ms	3.5×	213
MLP-32L (very deep)	CPU (ONNX Runtime)	8.70 ms	1.0× (baseline)	115
MLP-32L (very deep)	NPU (OpenVINO)	4.74 ms	1.8×	211

Key takeaway: The NPU excels at deep model inference — the deeper the model, the greater the NPU advantage. Shallow models (8–16 layers) see CPU win due to NPU dispatch overhead, but at 24+ layers the NPU delivers up to 3.5× speedup while freeing the CPU for other work. Run your own: python -m benchmarks.run --npu

Maximum Optimization Mode

Activates all available hardware resources in parallel with OS-level tuning:

from pyaccelerate.max_mode import MaxMode

with MaxMode() as m:
    print(m.summary())  # hardware manifest

    # Run CPU + I/O simultaneously
    results = m.run_all(
        cpu_fn=cpu_heavy_task, cpu_items=cpu_data,
        io_fn=io_heavy_task, io_items=io_data,
    )

    # I/O only (thread pool)
    downloaded = m.run_io(download, [(url,) for url in urls])

    # CPU only (process pool)
    computed = m.run_cpu(crunch, [(n,) for n in numbers])

    # Multi-stage pipeline
    results = m.run_pipeline([
        ("download", download_fn, urls),
        ("transform", transform_fn, data),
        ("save", save_fn, output),
    ])

Or via the Engine:

engine = Engine()
with engine.max_mode() as m:
    results = m.run_all(...)

OS Priority & Energy Management

Control process scheduling and power profiles across Windows, Linux & macOS:

from pyaccelerate.priority import (
    TaskPriority, EnergyProfile,
    set_task_priority, set_energy_profile,
    max_performance, balanced, power_saver,
)

# Quick presets
max_performance()   # HIGH priority + ULTRA_PERFORMANCE energy
balanced()          # Restore defaults
power_saver()       # BELOW_NORMAL + POWER_SAVER

# Fine-grained control
set_task_priority(TaskPriority.ABOVE_NORMAL)
set_energy_profile(EnergyProfile.PERFORMANCE)

CLI

pyaccelerate info          # Full hardware report (+ offers to install missing deps)
pyaccelerate benchmark     # Run micro-benchmarks
pyaccelerate gpu           # GPU details (architecture, cores, VRAM, clocks, features, driver, PCIe)
pyaccelerate cpu           # CPU details
pyaccelerate npu           # NPU details
pyaccelerate android       # ARM/Android device details (SoC, clusters, thermal)
pyaccelerate virt          # Virtualization info
pyaccelerate memory        # Memory stats (system + GPU dedicated/shared)
pyaccelerate limits        # Hardware-safe batch/worker limits
pyaccelerate limits --json # Limits as JSON
pyaccelerate status        # One-liner
pyaccelerate priority      # Show current priority/energy
pyaccelerate priority --preset max     # Apply max performance preset
pyaccelerate priority --set high       # Set task priority
pyaccelerate priority --energy performance  # Set energy profile
pyaccelerate max-mode      # Show max-mode hardware manifest
pyaccelerate tune          # Auto-tune: benchmark → optimise → save
pyaccelerate tune --apply  # Tune and apply to current process
pyaccelerate tune --show   # Show current tune profile
pyaccelerate metrics       # Start Prometheus /metrics server (:9090)
pyaccelerate metrics --once# Print metrics and exit
pyaccelerate serve         # Start HTTP/gRPC API server (:8420)
pyaccelerate k8s           # Kubernetes pod & GPU info
pyaccelerate k8s --manifest# Generate K8s Deployment YAML
pyaccelerate iot           # IoT / SBC board details
pyaccelerate version       # Print version

ARM / Android / Termux Support

Full hardware detection for ARM devices — phones (Termux, Pydroid), tablets, Raspberry Pi, ARM laptops (Snapdragon X Elite), and ARM servers:

from pyaccelerate.android import (
    is_android, is_termux, is_arm,
    get_device_info, get_soc_info,
    detect_big_little, get_arm_features,
    get_thermal_zones, get_battery_info,
)

if is_arm():
    soc = get_soc_info()
    if soc:
        print(f"{soc.name} ({soc.vendor})")   # Snapdragon 8 Gen 3 (Qualcomm)
        print(f"GPU: {soc.gpu_name}")           # Adreno 750
        print(f"NPU: {soc.npu_name} ({soc.npu_tops} TOPS)")  # Hexagon NPU (73.0 TOPS)

    clusters = detect_big_little()
    # {"Cortex-X4": [0], "Cortex-A720": [1,2,3], "Cortex-A520": [4,5,6,7]}

    features = get_arm_features()
    # ["aes", "asimd", "bf16", "crc32", "neon", "sve", "sve2", ...]

Supported SoC families (25+ chipsets in database):

Qualcomm — Snapdragon 8 Elite, 8/7/6 Gen 1-3, 888, 865, X Elite
Samsung — Exynos 2500, 2200, 2100, 1380, 990
Google — Tensor G1–G4
MediaTek — Dimensity 9300, 9200, 9000, 8300, 1200, 1100, 900
HiSilicon — Kirin 9010, 9000
Unisoc — T616

ARM GPU detection — Adreno, Mali, Immortalis, Xclipse, PowerVR, Maleoon (via SoC DB, sysfs, Vulkan, OpenCL)

ARM NPU detection — Hexagon, Samsung NPU, Google TPU, MediaTek APU, Da Vinci NPU (via SoC DB, NNAPI, TFLite)

Modules in Depth

Virtual Thread Pool

Inspired by Java's virtual threads — a persistent ThreadPoolExecutor sized for I/O (cores × 3, cap 32). All I/O-bound work shares this pool instead of creating/destroying threads per operation.

from pyaccelerate.threads import get_pool, run_parallel, submit

# Single task
fut = submit(download_file, url)

# Bounded concurrency (sliding window)
run_parallel(process, [(item,) for item in items], max_concurrent=8)

Work-Stealing Scheduler

High-performance scheduler inspired by Tokio (Rust), Go runtime and Java ForkJoinPool. Each worker owns a local Chase-Lev deque — pops LIFO (cache-friendly), steals FIFO (fair). When idle, workers steal from random victims with exponential back-off parking.

from pyaccelerate.work_stealing import WorkStealingScheduler, ws_submit, ws_map

# Module-level convenience
fut = ws_submit(my_func, arg1, arg2)
results = ws_map(fn, [(a,), (b,), (c,)])

# Full control
with WorkStealingScheduler(num_workers=8, steal_batch_size=4) as sched:
    futures = [sched.submit(process, item) for item in items]
    results = [f.result() for f in futures]
    print(sched.stats())  # completed, stolen, avg_latency_us

Or via the Engine:

engine = Engine()
fut = engine.ws_submit(my_func, arg1)
results = engine.ws_map(fn, [(a,), (b,)])

Lock-Free Task Queue

The lockfree_queue module provides two data structures underlying the work-stealing scheduler:

WorkDeque: Per-worker Chase-Lev deque — owner push/pop lock-free (GIL + collections.deque), stealers use a lightweight spinlock.
MPMCQueue: Multi-Producer Multi-Consumer global injection queue with efficient Event-based parking.

from pyaccelerate.lockfree_queue import WorkDeque, MPMCQueue

# Per-worker deque
d = WorkDeque()
d.push(task)
task = d.pop()        # LIFO (owner)
task = d.steal()      # FIFO (other workers)
batch = d.steal_batch(4)

# Global queue
q = MPMCQueue()
q.put(task)
q.put_batch([t1, t2, t3])
task = q.get()
q.wait(timeout=1.0)   # block until items arrive

Adaptive Scheduler

Wraps the work-stealing scheduler and dynamically tunes worker count based on real-time metrics:

Signal	Action
P95 latency > threshold + CPU < 70%	Scale up workers
CPU utilisation > 90%	Scale down workers
Memory pressure HIGH/CRITICAL	Shed workers immediately
P95 latency very low (idle)	Scale down to save resources
CPU load changes	Auto-tune steal batch size

from pyaccelerate.adaptive import AdaptiveScheduler, AdaptiveConfig

cfg = AdaptiveConfig(
    min_workers=2,
    max_workers=16,
    cooldown_seconds=2.0,
)

with AdaptiveScheduler(config=cfg) as sched:
    results = sched.map(process, [(item,) for item in data])
    print(sched.snapshot())  # workers, p95, cpu%, mem_pressure, adjustments

Or via the Engine:

engine = Engine()
with engine.adaptive_scheduler() as sched:
    results = sched.map(heavy_fn, items)

Native Accelerators (Optional)

For maximum throughput, compile the hot-path data structures to native code:

Cython (C extension):

pip install cython
cd src/pyaccelerate/_native
python setup_cython.py build_ext --inplace

Rust (PyO3 + crossbeam-deque — same algorithm as Tokio):

cd bindings/rust/pyaccelerate_native
pip install maturin
maturin develop --release

When a native extension is installed, it's used automatically — no code changes needed. The pure-Python fallback is always available.

Multi-GPU Dispatch

Auto-detects GPUs across CUDA, OpenCL and Intel oneAPI. Distributes workloads with configurable strategies.

from pyaccelerate.gpu import detect_all, dispatch

gpus = detect_all()
results = dispatch(my_kernel, data_chunks, strategy="score-weighted")

Profiling

Zero-config decorators for timing and memory tracking:

from pyaccelerate.profiler import timed, profile_memory, Tracker

@timed(level=logging.INFO)
def heavy_computation():
    ...

tracker = Tracker("db_queries")
for batch in batches:
    with tracker.measure():
        run_query(batch)
print(tracker.summary())

Auto-Tuning Feedback Loop

Benchmark your hardware, persist the optimal configuration, and auto-apply it:

from pyaccelerate.autotune import auto_tune, get_or_tune, apply_profile

# Run a full tune cycle (benchmark → save to ~/.pyaccelerate/)
profile = auto_tune()
print(f"Overall score: {profile.overall_score}/100")
print(f"Optimal IO workers: {profile.optimal_io_workers}")
print(f"Optimal CPU workers: {profile.optimal_cpu_workers}")

# Load existing or re-tune if hardware changed / profile stale
profile = get_or_tune()

# Apply to running process (sets workers, priority, energy)
apply_profile()

Hardware-Safe Configuration

Prevent runaway concurrency on constrained hardware (iGPUs, LLM inference, Ollama, etc.):

from pyaccelerate.autotune import hardware_safe_limits, clamp_config

# Get hardware-derived upper bounds
limits = hardware_safe_limits()
# Discrete GPU:  {"batch_size": 32, "max_workers": N, "max_concurrent": 4}
# Integrated GPU: {"batch_size": 8, "max_workers": 1, "max_concurrent": 2}
# CPU-only:      {"batch_size": 5, "max_workers": 1, "max_concurrent": 2}

# Clamp user/config values to safe maximums
user_config = {"batch_size": 50, "max_workers": 4}
safe = clamp_config(user_config)
# iGPU → {"batch_size": 8, "max_workers": 1}

GPU Memory Stats

Query dedicated + shared VRAM and hardware details for the best GPU:

from pyaccelerate.memory import get_gpu_memory_stats

stats = get_gpu_memory_stats()
# {"gpu_available": 1.0, "gpu_dedicated_gb": 6.0,
#  "gpu_shared_gb": 28.0, "gpu_total_gb": 34.0,
#  "gpu_is_discrete": 1.0, "gpu_vulkan": 1.0,
#  "gpu_cuda_cores": 1408, "gpu_tensor_cores": 0,
#  "gpu_has_hw_encode": 1.0, "gpu_has_hw_decode": 1.0,
#  "gpu_memory_bandwidth_gbps": 336.0, "gpu_boost_clock_mhz": 2130}

GPU Shared VRAM & Vulkan Detection

The GPUDevice dataclass exposes comprehensive GPU hardware details — architecture, cores, features, clocks, memory type, driver, PCIe, and shared VRAM:

from pyaccelerate.gpu import detect_all

for g in detect_all():
    print(f"{g.name}: {g.memory_gb:.1f} GB dedicated")
    print(f"  Architecture:    {g.architecture} | CC {g.cuda_capability}")
    print(f"  Cores:           CUDA={g.cuda_cores} Tensor={g.tensor_cores} RT={g.rt_cores}")
    print(f"  Shared VRAM:     {g.shared_memory_gb:.1f} GB")
    print(f"  Total:           {g.total_memory_gb:.1f} GB")
    print(f"  Memory:          {g.memory_type} | {g.memory_bus_width}-bit | {g.memory_bandwidth_gbps:.0f} GB/s")
    print(f"  Clock:           {g.boost_clock_mhz} MHz boost")
    print(f"  Features:        {', '.join(g.features)}")
    print(f"  HW Encode/Dec:   NVENC={g.has_nvenc} NVDEC={g.has_nvdec}")
    print(f"  PCIe:            Gen{g.pcie_gen} x{g.pcie_width}")
    print(f"  Driver:          {g.driver_version} | CUDA: {g.cuda_driver_version}")
    print(f"  Vulkan:          {g.vulkan_version or 'not detected'}")

Supported GPU Attributes

Field	Type	Description
`architecture`	`str`	GPU architecture (Turing, Ampere, Ada Lovelace, RDNA 3, etc.)
`cuda_capability`	`str`	NVIDIA Compute Capability (e.g. "7.5")
`cuda_cores`	`int`	Total CUDA cores (NVIDIA) or Stream Processors (AMD)
`tensor_cores`	`int`	Tensor core count (RTX / data-center only)
`rt_cores`	`int`	RT core count (RTX only)
`has_tensor`	`bool`	Tensor / mixed-precision acceleration available
`has_raytracing`	`bool`	Hardware ray tracing available
`has_nvenc`	`bool`	Hardware video encoder (NVENC / VCN)
`has_nvdec`	`bool`	Hardware video decoder (NVDEC / VCN)
`clock_mhz`	`int`	Base clock (MHz)
`boost_clock_mhz`	`int`	Boost clock (MHz)
`memory_clock_mhz`	`int`	Memory clock (MHz)
`memory_type`	`str`	Memory type (GDDR5, GDDR6, GDDR6X, HBM2e, HBM3)
`memory_bus_width`	`int`	Memory bus width (bits)
`memory_bandwidth_gbps`	`float`	Effective memory bandwidth (GB/s)
`pcie_gen`	`int`	PCIe generation (3, 4, 5)
`pcie_width`	`int`	PCIe link width (16, 8, etc.)
`driver_version`	`str`	GPU driver version
`cuda_driver_version`	`str`	CUDA runtime version
`power_limit_w`	`int`	TDP / power limit (watts)
`copy_engines`	`int`	Async DMA copy engines
`features`	`list[str]`	Capability flags (compute, tensor, hw_encode, cuda_7.5, turing, etc.)
`shared_memory_bytes`	`int`	Shared system memory (all vendors on Windows)
`vulkan_version`	`str`	Vulkan API version

NVIDIA Architecture Coverage

Kepler (3.0) → Maxwell (5.x) → Pascal (6.x) → Volta (7.0) → Turing (7.5) → Ampere (8.x) → Ada Lovelace (8.9) → Hopper (9.0) → Blackwell (10.x)

GTX 16xx (Turing): CUDA cores + NVENC/NVDEC, no Tensor/RT cores
RTX 20xx (Turing): Full Tensor + RT cores
RTX 30xx (Ampere): Enhanced Tensor + RT cores, 2nd gen
RTX 40xx (Ada Lovelace): 4th gen Tensor, 3rd gen RT, DLSS 3
RTX 50xx (Blackwell): 5th gen Tensor, 4th gen RT
Data center (A100/H100/B100): Tensor cores, no RT cores

AMD Architecture Coverage

GCN 4 (RX 400/500) → GCN 5 (Vega) → RDNA (RX 5000) → RDNA 2 (RX 6000) → RDNA 3 (RX 7000) → RDNA 4 (RX 9000) → CDNA/CDNA 2/CDNA 3 (Instinct MI series)

Prometheus Metrics

Expose CPU/GPU/NPU/memory/pool metrics in Prometheus format:

from pyaccelerate.metrics import start_metrics_server, get_metrics_text

# Start /metrics endpoint on port 9090
start_metrics_server(port=9090)

# Or get text for your own framework
text = get_metrics_text()

pyaccelerate metrics --port 9090     # Start server
pyaccelerate metrics --once          # Print and exit
curl http://localhost:9090/metrics    # Scrape

HTTP / gRPC Server

Multi-language access to all PyAccelerate features:

from pyaccelerate.server import PyAccelerateServer

with PyAccelerateServer(http_port=8420, grpc_port=50051) as srv:
    print(f"HTTP: {srv.http_url}/api/v1")
    # Block until Ctrl+C
    srv.start(block=True)

pyaccelerate serve --http-port 8420 --grpc-port 50051
curl http://localhost:8420/api/v1/info    # JSON
curl http://localhost:8420/api/v1/cpu
curl http://localhost:8420/api/v1/gpu
curl http://localhost:8420/api/v1/metrics  # Prometheus text

Kubernetes Integration

Pod detection, GPU node capacity, auto-scaling recommendations & manifest generation:

from pyaccelerate.k8s import (
    is_kubernetes, get_pod_info,
    get_scaling_recommendation, generate_resource_manifest,
)

if is_kubernetes():
    pod = get_pod_info()
    print(f"Pod: {pod.name} | GPU: {pod.gpu_limit}")

rec = get_scaling_recommendation()
print(f"Replicas: {rec.recommended_replicas} ({rec.reason})")

yaml = generate_resource_manifest(name="ml-worker", gpu_per_replica=1)

pyaccelerate k8s                # Show pod & GPU info
pyaccelerate k8s --manifest     # Generate Deployment YAML
pyaccelerate k8s --json         # Machine-readable

Node.js / npm Client

A zero-dependency Node.js client is included in bindings/nodejs/:

const { PyAccelerate } = require('pyaccelerate');

const client = new PyAccelerate('http://localhost:8420');
const info = await client.getInfo();
const metrics = await client.getMetrics();
const bench = await client.runBenchmark();

Installation Options

# Core (CPU + threads + memory + virt)
pip install pyaccelerate

# With NVIDIA GPU support
pip install pyaccelerate[cuda]

# With OpenCL support (AMD/Intel/NVIDIA)
pip install pyaccelerate[opencl]

# With Intel oneAPI support
pip install pyaccelerate[intel]

# All GPU backends
pip install pyaccelerate[all-gpu]

# gRPC server mode
pip install pyaccelerate[grpc]

# Kubernetes integration
pip install pyaccelerate[k8s]

# Development
pip install pyaccelerate[dev]

Docker

# CPU-only
docker build -t pyaccelerate .
docker run --rm pyaccelerate info

# With NVIDIA GPU
docker build -f Dockerfile.gpu -t pyaccelerate:gpu .
docker run --rm --gpus all pyaccelerate:gpu info

# Docker Compose
docker compose up pyaccelerate    # CPU
docker compose up gpu             # GPU

Development

git clone https://github.com/GuilhermeP96/pyaccelerate.git
cd pyaccelerate
pip install -e ".[dev]"

# Run tests
pytest -v

# Run benchmarks
python -m benchmarks.run            # full suite
python -m benchmarks.run --quick    # CI-friendly (fewer tasks)
python -m benchmarks.run --io       # IO-bound only
python -m benchmarks.run --cpu      # CPU-bound only
python -m benchmarks.run --mixed    # mixed workloads only
python -m benchmarks.run --npu      # NPU inference only

# Lint + format
ruff check src/ tests/
ruff format src/ tests/

# Type check
mypy src/

# Build wheel
python -m build

Architecture

pyaccelerate/
├── cpu.py          # CPU detection & topology
├── threads.py      # Virtual thread pool & executors
├── gpu/
│   ├── detector.py # Multi-vendor GPU enumeration
│   ├── cuda.py     # CUDA/CuPy helpers
│   ├── opencl.py   # PyOpenCL helpers
│   ├── intel.py    # Intel oneAPI helpers
│   └── dispatch.py # Multi-GPU load balancer
├── npu/
│   ├── detector.py # NPU detection (Intel, Qualcomm, Apple)
│   ├── onnx_rt.py  # ONNX Runtime inference
│   ├── openvino.py # OpenVINO inference
│   └── inference.py# Unified inference API
├── virt.py         # Virtualization detection
├── memory.py       # Memory monitoring & buffer pool
├── profiler.py     # Timing & profiling utilities
├── benchmark.py    # Built-in micro-benchmarks
├── priority.py     # OS task priority & energy profiles
├── max_mode.py     # Maximum optimization mode
├── iot.py          # IoT / SBC hardware detection
├── autotune.py     # Auto-tuning feedback loop
├── metrics.py      # Prometheus metrics exporter
├── server.py       # HTTP + gRPC multi-language API
├── k8s.py          # Kubernetes pod & GPU integration
├── lockfree_queue.py # Lock-free MPMC & Chase-Lev deques
├── work_stealing.py  # Work-stealing scheduler (Tokio/Go/FJP)
├── adaptive.py       # Adaptive pressure-driven scheduler
├── engine.py         # Unified orchestrator
├── cli.py            # Command-line interface
├── _native/          # Optional Cython accelerators
│   ├── _fast_deque.pyx
│   └── setup_cython.py
└── bindings/
    ├── nodejs/       # npm client for Node.js / TypeScript
    └── rust/         # PyO3 Rust native extension
        └── pyaccelerate_native/

Examples

The examples/ directory contains runnable scripts demonstrating all features:

Example	Description
`example_basic.py`	Engine creation, summary, submit, run_parallel, batch
`example_parallel_io.py`	Parallel download/process/write with public UCI ML datasets
`example_cpu_bound.py`	Sequential vs thread pool vs process pool comparison
`example_max_mode.py`	MaxMode context manager, run_all, run_io, run_cpu, pipeline
`example_pipeline.py`	Multi-stage data pipeline (download → analyze → report)
`example_priority.py`	TaskPriority levels, EnergyProfile, presets, benchmarking

cd examples
python example_basic.py
python example_max_mode.py
python example_priority.py

Roadmap

IoT / SBC detection (Raspberry Pi, Jetson, Coral, Hailo)
Auto-tuning feedback loop (benchmark → config → re-tune)
Prometheus metrics exporter
gRPC server mode for multi-language integration
Kubernetes operator for auto-scaling GPU workloads
npm package (Node.js bindings via HTTP API)
Work-stealing scheduler (Tokio / Go / ForkJoinPool style)
Lock-free task queues (Chase-Lev deques, MPMC)
Adaptive scheduler (latency, CPU & memory pressure)
Optional native accelerators (Cython + Rust/PyO3)
Benchmark suite with IO/CPU/mixed workload comparisons
Shared VRAM detection for integrated GPUs (Intel Iris Xe, etc.)
Vulkan version probing per GPU
Hardware-safe configuration limits (hardware_safe_limits, clamp_config)
GPU memory stats API (get_gpu_memory_stats)
NPU detection for Intel Core Ultra (Arrow Lake / Meteor Lake) with TOPS estimation
NPU inference benchmarks (NPU vs CPU, ONNX models)
Interactive dependency installer in pyaccelerate info CLI

Origin

Evolved from the acceleration & virtual-thread systems built for:

adb-toolkit — multi-GPU acceleration, virtual thread pool
python-gpu-statistical-analysis — GPU compute foundations

License

MIT — see LICENSE.

📊 PyPI Download Statistics

Last updated: 2026-03-27 15:46 UTC (refreshed every Monday via GitHub Actions — run manually)

Period	Downloads
Last day	23
Last week	278
Last month	1,361
Total	4,309

By Python version (top 5)

Version	Downloads
Python 3	188

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

GuilhermeP96

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.11.0

Apr 3, 2026

This version

0.10.0

Mar 27, 2026

0.9.0

Mar 25, 2026

0.8.1

Mar 25, 2026

0.8.0

Mar 25, 2026

0.7.1

Mar 18, 2026

0.7.0

Mar 14, 2026

0.6.2

Mar 9, 2026

0.6.1

Feb 28, 2026

0.6.0

Feb 28, 2026

0.5.0

Feb 28, 2026

0.4.0

Feb 28, 2026

0.3.0

Feb 28, 2026

0.2.0

Feb 28, 2026

0.1.0

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyaccelerate-0.10.0.tar.gz (162.1 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyaccelerate-0.10.0-py3-none-any.whl (140.0 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file pyaccelerate-0.10.0.tar.gz.

File metadata

Download URL: pyaccelerate-0.10.0.tar.gz
Upload date: Mar 27, 2026
Size: 162.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyaccelerate-0.10.0.tar.gz
Algorithm	Hash digest
SHA256	`8cb64259f0a4c08d58ed4cb9b42acfbf53edbb6184a76eef7170af7a6af02451`
MD5	`030e846d7ac37dd571d39478b5212d4d`
BLAKE2b-256	`d6c008d0939fa1888c1c68735046594a0f13fe27e6c9ced837784377421b18b1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyaccelerate-0.10.0.tar.gz:

Publisher: publish.yml on GuilhermeP96/pyaccelerate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyaccelerate-0.10.0.tar.gz
- Subject digest: 8cb64259f0a4c08d58ed4cb9b42acfbf53edbb6184a76eef7170af7a6af02451
- Sigstore transparency entry: 1188969915
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: GuilhermeP96/pyaccelerate@b4e872451b939d53d569564e68e9bee32b9a0d2e
- Branch / Tag: refs/tags/v0.10.0
- Owner: https://github.com/GuilhermeP96
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b4e872451b939d53d569564e68e9bee32b9a0d2e
- Trigger Event: release

File details

Details for the file pyaccelerate-0.10.0-py3-none-any.whl.

File metadata

Download URL: pyaccelerate-0.10.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 140.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyaccelerate-0.10.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0b7367e40b7957bb9cea20e33c38950c15ccfd38633c00f88c98996822f74ea7`
MD5	`9de338b33d8dee906c062fd42bb808d7`
BLAKE2b-256	`1e1c94c755636260764c545a672050a2588cec20d746b99dc9e9da41573483bb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyaccelerate-0.10.0-py3-none-any.whl:

Publisher: publish.yml on GuilhermeP96/pyaccelerate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyaccelerate-0.10.0-py3-none-any.whl
- Subject digest: 0b7367e40b7957bb9cea20e33c38950c15ccfd38633c00f88c98996822f74ea7
- Sigstore transparency entry: 1188969916
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: GuilhermeP96/pyaccelerate@b4e872451b939d53d569564e68e9bee32b9a0d2e
- Branch / Tag: refs/tags/v0.10.0
- Owner: https://github.com/GuilhermeP96
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b4e872451b939d53d569564e68e9bee32b9a0d2e
- Trigger Event: release

pyaccelerate 0.10.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

PyAccelerate

Features

Quick Start

Benchmark

IO-bound — 200 simulated HTTP calls (20ms each)

IO-bound — 200 variable-latency calls (5–80ms, realistic)

CPU-bound — zlib compress 200KB × 100 (GIL released by C extension)

Mixed — IO (10ms) + CPU (SHA-256 × 400) × 200

NPU Inference — ONNX model, NPU (OpenVINO) vs CPU

Maximum Optimization Mode

OS Priority & Energy Management

CLI

ARM / Android / Termux Support

Modules in Depth

Virtual Thread Pool

Work-Stealing Scheduler

Lock-Free Task Queue

Adaptive Scheduler

Native Accelerators (Optional)

Multi-GPU Dispatch

Profiling

Auto-Tuning Feedback Loop

Hardware-Safe Configuration

GPU Memory Stats

GPU Shared VRAM & Vulkan Detection

Supported GPU Attributes

NVIDIA Architecture Coverage

AMD Architecture Coverage

Prometheus Metrics

HTTP / gRPC Server

Kubernetes Integration

Node.js / npm Client

Installation Options

Docker

Development

Architecture

Examples

Roadmap

Origin

License

📊 PyPI Download Statistics

By Python version (top 5)

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance