Skip to main content

High-performance Python acceleration engine — CPU, threads, virtual threads, multi-GPU, NPU, ARM/Android/Termux, IoT/SBC and virtualization.

Project description

PyAccelerate

High-performance Python acceleration engine — CPU, threads, virtual threads, multi-GPU, NPU, ARM/Android/Termux, IoT/SBC, auto-tuning, Prometheus metrics, HTTP/gRPC server, Kubernetes auto-scaling, and maximum optimization mode.

CI Python 3.10+ License: MIT


Features

Module Description
cpu CPU detection, topology, NUMA, affinity, ISA flags, ARM big.LITTLE/DynamIQ, dynamic worker recommendations
threads Persistent virtual-thread pool, sliding-window executor, async bridge, process pool
work_stealing Work-stealing scheduler (Tokio / Go runtime / ForkJoinPool style) — Chase-Lev deques, random victim selection
lockfree_queue Lock-free MPMC queue & per-worker Chase-Lev deques for minimal contention
adaptive Adaptive scheduler — auto-scales workers based on latency, CPU pressure & memory pressure
_native Optional Cython / Rust (PyO3) accelerators for hot-path data structures
gpu Multi-vendor GPU detection (NVIDIA/CUDA, AMD/OpenCL, Intel oneAPI, ARM Adreno/Mali/Immortalis), ranking, multi-GPU dispatch
npu NPU detection & inference (OpenVINO, ONNX Runtime, DirectML, CoreML, ARM Hexagon/Samsung NPU/Tensor TPU/MediaTek APU)
virt Virtualization detection (Hyper-V, VT-x/AMD-V, KVM, WSL2, Docker, container detection)
memory Memory pressure monitoring, automatic worker clamping, reusable buffer pool
profiler @timed, @profile_memory decorators, Timer context manager, Tracker statistics
benchmark Built-in micro-benchmarks (CPU, threads, memory bandwidth, GPU compute)
priority OS-level task priority (IDLE → REALTIME) & energy profiles (POWER_SAVER → ULTRA_PERFORMANCE)
max_mode Maximum optimization mode — activates ALL resources simultaneously with OS tuning
android Android/Termux platform detection, ARM SoC database (25+ chipsets), big.LITTLE, thermal & battery
iot IoT / SBC detection (Raspberry Pi, Jetson, BeagleBone, Coral, Hailo, 30+ SoCs)
autotune Auto-tuning feedback loop — benchmark → config → re-tune, persistent profiles
metrics Prometheus metrics exporter (/metrics HTTP endpoint, all subsystems)
server JSON HTTP & gRPC server for multi-language integration (Node.js, Go, Java, etc.)
k8s Kubernetes operator — pod info, GPU node capacity, auto-scaling, manifest generation
engine Unified orchestrator — auto-detects everything and provides a single API

Quick Start

pip install pyaccelerate
from pyaccelerate import Engine

engine = Engine()
print(engine.summary())

# Submit I/O-bound tasks to the virtual thread pool
future = engine.submit(my_io_func, arg1, arg2)

# Run many tasks with auto-tuned concurrency
engine.run_parallel(process_file, [(f,) for f in files])

# GPU dispatch (auto-fallback to CPU)
results = engine.gpu_dispatch(my_kernel, data_chunks)

Benchmark

"Ok... but how much faster is it?" — Here are real numbers.

All benchmarks run on Python 3.11 / Windows / 48-core Xeon with python -m benchmarks.run. Reproduce: pip install pyaccelerate && python -m benchmarks.run

IO-bound — 200 simulated HTTP calls (20ms each)

Runner Time Speedup Tasks/sec
Sequential 4.079s 1.0× 49
ThreadPoolExecutor 0.118s 34.7× 1,701
pyaccelerate.engine 0.193s 21.2× 1,038
pyaccelerate.threads 0.150s 27.3× 1,336
pyaccelerate.ws 0.240s 17.0× 832

IO-bound — 200 variable-latency calls (5–80ms, realistic)

Runner Time Speedup Tasks/sec
Sequential 8.350s 1.0× 24
ThreadPoolExecutor 0.245s 34.1× 817
pyaccelerate.threads 0.329s 25.4× 608
pyaccelerate.ws 0.369s 22.6× 542
pyaccelerate.adaptive 0.656s 12.7× 305

CPU-bound — zlib compress 200KB × 100 (GIL released by C extension)

Runner Time Speedup Tasks/sec
Sequential 0.153s 1.0× 654
ThreadPoolExecutor 0.049s 3.1× 2,026
pyaccelerate.threads 0.049s 3.1× 2,033
pyaccelerate.ws 0.079s 1.9× 1,264

Mixed — IO (10ms) + CPU (SHA-256 × 400) × 200

Runner Time Speedup Tasks/sec
Sequential 2.216s 1.0× 90
ThreadPoolExecutor 0.197s 11.3× 1,017
pyaccelerate.threads 0.162s 13.7× 1,233
pyaccelerate.engine 0.206s 10.8× 972
pyaccelerate.adaptive 0.634s 3.5× 315

Key takeaway: pyaccelerate.threads beats ThreadPoolExecutor by 21% on mixed workloads. The work-stealing scheduler (ws) excels at variable-latency IO where load balancing matters most. Run your own benchmarks: python -m benchmarks.run (full) or python -m benchmarks.run --quick (CI).


Maximum Optimization Mode

Activates all available hardware resources in parallel with OS-level tuning:

from pyaccelerate.max_mode import MaxMode

with MaxMode() as m:
    print(m.summary())  # hardware manifest

    # Run CPU + I/O simultaneously
    results = m.run_all(
        cpu_fn=cpu_heavy_task, cpu_items=cpu_data,
        io_fn=io_heavy_task, io_items=io_data,
    )

    # I/O only (thread pool)
    downloaded = m.run_io(download, [(url,) for url in urls])

    # CPU only (process pool)
    computed = m.run_cpu(crunch, [(n,) for n in numbers])

    # Multi-stage pipeline
    results = m.run_pipeline([
        ("download", download_fn, urls),
        ("transform", transform_fn, data),
        ("save", save_fn, output),
    ])

Or via the Engine:

engine = Engine()
with engine.max_mode() as m:
    results = m.run_all(...)

OS Priority & Energy Management

Control process scheduling and power profiles across Windows, Linux & macOS:

from pyaccelerate.priority import (
    TaskPriority, EnergyProfile,
    set_task_priority, set_energy_profile,
    max_performance, balanced, power_saver,
)

# Quick presets
max_performance()   # HIGH priority + ULTRA_PERFORMANCE energy
balanced()          # Restore defaults
power_saver()       # BELOW_NORMAL + POWER_SAVER

# Fine-grained control
set_task_priority(TaskPriority.ABOVE_NORMAL)
set_energy_profile(EnergyProfile.PERFORMANCE)

CLI

pyaccelerate info          # Full hardware report
pyaccelerate benchmark     # Run micro-benchmarks
pyaccelerate gpu           # GPU details
pyaccelerate cpu           # CPU details
pyaccelerate npu           # NPU details
pyaccelerate android       # ARM/Android device details (SoC, clusters, thermal)
pyaccelerate virt          # Virtualization info
pyaccelerate memory        # Memory stats
pyaccelerate status        # One-liner
pyaccelerate priority      # Show current priority/energy
pyaccelerate priority --preset max     # Apply max performance preset
pyaccelerate priority --set high       # Set task priority
pyaccelerate priority --energy performance  # Set energy profile
pyaccelerate max-mode      # Show max-mode hardware manifest
pyaccelerate tune          # Auto-tune: benchmark → optimise → save
pyaccelerate tune --apply  # Tune and apply to current process
pyaccelerate tune --show   # Show current tune profile
pyaccelerate metrics       # Start Prometheus /metrics server (:9090)
pyaccelerate metrics --once# Print metrics and exit
pyaccelerate serve         # Start HTTP/gRPC API server (:8420)
pyaccelerate k8s           # Kubernetes pod & GPU info
pyaccelerate k8s --manifest# Generate K8s Deployment YAML
pyaccelerate iot           # IoT / SBC board details
pyaccelerate version       # Print version

ARM / Android / Termux Support

Full hardware detection for ARM devices — phones (Termux, Pydroid), tablets, Raspberry Pi, ARM laptops (Snapdragon X Elite), and ARM servers:

from pyaccelerate.android import (
    is_android, is_termux, is_arm,
    get_device_info, get_soc_info,
    detect_big_little, get_arm_features,
    get_thermal_zones, get_battery_info,
)

if is_arm():
    soc = get_soc_info()
    if soc:
        print(f"{soc.name} ({soc.vendor})")   # Snapdragon 8 Gen 3 (Qualcomm)
        print(f"GPU: {soc.gpu_name}")           # Adreno 750
        print(f"NPU: {soc.npu_name} ({soc.npu_tops} TOPS)")  # Hexagon NPU (73.0 TOPS)

    clusters = detect_big_little()
    # {"Cortex-X4": [0], "Cortex-A720": [1,2,3], "Cortex-A520": [4,5,6,7]}

    features = get_arm_features()
    # ["aes", "asimd", "bf16", "crc32", "neon", "sve", "sve2", ...]

Supported SoC families (25+ chipsets in database):

  • Qualcomm — Snapdragon 8 Elite, 8/7/6 Gen 1-3, 888, 865, X Elite
  • Samsung — Exynos 2500, 2200, 2100, 1380, 990
  • Google — Tensor G1–G4
  • MediaTek — Dimensity 9300, 9200, 9000, 8300, 1200, 1100, 900
  • HiSilicon — Kirin 9010, 9000
  • Unisoc — T616

ARM GPU detection — Adreno, Mali, Immortalis, Xclipse, PowerVR, Maleoon (via SoC DB, sysfs, Vulkan, OpenCL)

ARM NPU detection — Hexagon, Samsung NPU, Google TPU, MediaTek APU, Da Vinci NPU (via SoC DB, NNAPI, TFLite)

Modules in Depth

Virtual Thread Pool

Inspired by Java's virtual threads — a persistent ThreadPoolExecutor sized for I/O (cores × 3, cap 32). All I/O-bound work shares this pool instead of creating/destroying threads per operation.

from pyaccelerate.threads import get_pool, run_parallel, submit

# Single task
fut = submit(download_file, url)

# Bounded concurrency (sliding window)
run_parallel(process, [(item,) for item in items], max_concurrent=8)

Work-Stealing Scheduler

High-performance scheduler inspired by Tokio (Rust), Go runtime and Java ForkJoinPool. Each worker owns a local Chase-Lev deque — pops LIFO (cache-friendly), steals FIFO (fair). When idle, workers steal from random victims with exponential back-off parking.

from pyaccelerate.work_stealing import WorkStealingScheduler, ws_submit, ws_map

# Module-level convenience
fut = ws_submit(my_func, arg1, arg2)
results = ws_map(fn, [(a,), (b,), (c,)])

# Full control
with WorkStealingScheduler(num_workers=8, steal_batch_size=4) as sched:
    futures = [sched.submit(process, item) for item in items]
    results = [f.result() for f in futures]
    print(sched.stats())  # completed, stolen, avg_latency_us

Or via the Engine:

engine = Engine()
fut = engine.ws_submit(my_func, arg1)
results = engine.ws_map(fn, [(a,), (b,)])

Lock-Free Task Queue

The lockfree_queue module provides two data structures underlying the work-stealing scheduler:

  • WorkDeque: Per-worker Chase-Lev deque — owner push/pop lock-free (GIL + collections.deque), stealers use a lightweight spinlock.
  • MPMCQueue: Multi-Producer Multi-Consumer global injection queue with efficient Event-based parking.
from pyaccelerate.lockfree_queue import WorkDeque, MPMCQueue

# Per-worker deque
d = WorkDeque()
d.push(task)
task = d.pop()        # LIFO (owner)
task = d.steal()      # FIFO (other workers)
batch = d.steal_batch(4)

# Global queue
q = MPMCQueue()
q.put(task)
q.put_batch([t1, t2, t3])
task = q.get()
q.wait(timeout=1.0)   # block until items arrive

Adaptive Scheduler

Wraps the work-stealing scheduler and dynamically tunes worker count based on real-time metrics:

Signal Action
P95 latency > threshold + CPU < 70% Scale up workers
CPU utilisation > 90% Scale down workers
Memory pressure HIGH/CRITICAL Shed workers immediately
P95 latency very low (idle) Scale down to save resources
CPU load changes Auto-tune steal batch size
from pyaccelerate.adaptive import AdaptiveScheduler, AdaptiveConfig

cfg = AdaptiveConfig(
    min_workers=2,
    max_workers=16,
    cooldown_seconds=2.0,
)

with AdaptiveScheduler(config=cfg) as sched:
    results = sched.map(process, [(item,) for item in data])
    print(sched.snapshot())  # workers, p95, cpu%, mem_pressure, adjustments

Or via the Engine:

engine = Engine()
with engine.adaptive_scheduler() as sched:
    results = sched.map(heavy_fn, items)

Native Accelerators (Optional)

For maximum throughput, compile the hot-path data structures to native code:

Cython (C extension):

pip install cython
cd src/pyaccelerate/_native
python setup_cython.py build_ext --inplace

Rust (PyO3 + crossbeam-deque — same algorithm as Tokio):

cd bindings/rust/pyaccelerate_native
pip install maturin
maturin develop --release

When a native extension is installed, it's used automatically — no code changes needed. The pure-Python fallback is always available.

Multi-GPU Dispatch

Auto-detects GPUs across CUDA, OpenCL and Intel oneAPI. Distributes workloads with configurable strategies.

from pyaccelerate.gpu import detect_all, dispatch

gpus = detect_all()
results = dispatch(my_kernel, data_chunks, strategy="score-weighted")

Profiling

Zero-config decorators for timing and memory tracking:

from pyaccelerate.profiler import timed, profile_memory, Tracker

@timed(level=logging.INFO)
def heavy_computation():
    ...

tracker = Tracker("db_queries")
for batch in batches:
    with tracker.measure():
        run_query(batch)
print(tracker.summary())

Auto-Tuning Feedback Loop

Benchmark your hardware, persist the optimal configuration, and auto-apply it:

from pyaccelerate.autotune import auto_tune, get_or_tune, apply_profile

# Run a full tune cycle (benchmark → save to ~/.pyaccelerate/)
profile = auto_tune()
print(f"Overall score: {profile.overall_score}/100")
print(f"Optimal IO workers: {profile.optimal_io_workers}")
print(f"Optimal CPU workers: {profile.optimal_cpu_workers}")

# Load existing or re-tune if hardware changed / profile stale
profile = get_or_tune()

# Apply to running process (sets workers, priority, energy)
apply_profile()

Prometheus Metrics

Expose CPU/GPU/NPU/memory/pool metrics in Prometheus format:

from pyaccelerate.metrics import start_metrics_server, get_metrics_text

# Start /metrics endpoint on port 9090
start_metrics_server(port=9090)

# Or get text for your own framework
text = get_metrics_text()
pyaccelerate metrics --port 9090     # Start server
pyaccelerate metrics --once          # Print and exit
curl http://localhost:9090/metrics    # Scrape

HTTP / gRPC Server

Multi-language access to all PyAccelerate features:

from pyaccelerate.server import PyAccelerateServer

with PyAccelerateServer(http_port=8420, grpc_port=50051) as srv:
    print(f"HTTP: {srv.http_url}/api/v1")
    # Block until Ctrl+C
    srv.start(block=True)
pyaccelerate serve --http-port 8420 --grpc-port 50051
curl http://localhost:8420/api/v1/info    # JSON
curl http://localhost:8420/api/v1/cpu
curl http://localhost:8420/api/v1/gpu
curl http://localhost:8420/api/v1/metrics  # Prometheus text

Kubernetes Integration

Pod detection, GPU node capacity, auto-scaling recommendations & manifest generation:

from pyaccelerate.k8s import (
    is_kubernetes, get_pod_info,
    get_scaling_recommendation, generate_resource_manifest,
)

if is_kubernetes():
    pod = get_pod_info()
    print(f"Pod: {pod.name} | GPU: {pod.gpu_limit}")

rec = get_scaling_recommendation()
print(f"Replicas: {rec.recommended_replicas} ({rec.reason})")

yaml = generate_resource_manifest(name="ml-worker", gpu_per_replica=1)
pyaccelerate k8s                # Show pod & GPU info
pyaccelerate k8s --manifest     # Generate Deployment YAML
pyaccelerate k8s --json         # Machine-readable

Node.js / npm Client

A zero-dependency Node.js client is included in bindings/nodejs/:

const { PyAccelerate } = require('pyaccelerate');

const client = new PyAccelerate('http://localhost:8420');
const info = await client.getInfo();
const metrics = await client.getMetrics();
const bench = await client.runBenchmark();

Installation Options

# Core (CPU + threads + memory + virt)
pip install pyaccelerate

# With NVIDIA GPU support
pip install pyaccelerate[cuda]

# With OpenCL support (AMD/Intel/NVIDIA)
pip install pyaccelerate[opencl]

# With Intel oneAPI support
pip install pyaccelerate[intel]

# All GPU backends
pip install pyaccelerate[all-gpu]

# gRPC server mode
pip install pyaccelerate[grpc]

# Kubernetes integration
pip install pyaccelerate[k8s]

# Development
pip install pyaccelerate[dev]

Docker

# CPU-only
docker build -t pyaccelerate .
docker run --rm pyaccelerate info

# With NVIDIA GPU
docker build -f Dockerfile.gpu -t pyaccelerate:gpu .
docker run --rm --gpus all pyaccelerate:gpu info

# Docker Compose
docker compose up pyaccelerate    # CPU
docker compose up gpu             # GPU

Development

git clone https://github.com/GuilhermeP96/pyaccelerate.git
cd pyaccelerate
pip install -e ".[dev]"

# Run tests
pytest -v

# Run benchmarks
python -m benchmarks.run            # full suite
python -m benchmarks.run --quick    # CI-friendly (fewer tasks)
python -m benchmarks.run --io       # IO-bound only
python -m benchmarks.run --cpu      # CPU-bound only
python -m benchmarks.run --mixed    # mixed workloads only

# Lint + format
ruff check src/ tests/
ruff format src/ tests/

# Type check
mypy src/

# Build wheel
python -m build

Architecture

pyaccelerate/
├── cpu.py          # CPU detection & topology
├── threads.py      # Virtual thread pool & executors
├── gpu/
│   ├── detector.py # Multi-vendor GPU enumeration
│   ├── cuda.py     # CUDA/CuPy helpers
│   ├── opencl.py   # PyOpenCL helpers
│   ├── intel.py    # Intel oneAPI helpers
│   └── dispatch.py # Multi-GPU load balancer
├── npu/
│   ├── detector.py # NPU detection (Intel, Qualcomm, Apple)
│   ├── onnx_rt.py  # ONNX Runtime inference
│   ├── openvino.py # OpenVINO inference
│   └── inference.py# Unified inference API
├── virt.py         # Virtualization detection
├── memory.py       # Memory monitoring & buffer pool
├── profiler.py     # Timing & profiling utilities
├── benchmark.py    # Built-in micro-benchmarks
├── priority.py     # OS task priority & energy profiles
├── max_mode.py     # Maximum optimization mode
├── iot.py          # IoT / SBC hardware detection
├── autotune.py     # Auto-tuning feedback loop
├── metrics.py      # Prometheus metrics exporter
├── server.py       # HTTP + gRPC multi-language API
├── k8s.py          # Kubernetes pod & GPU integration
├── lockfree_queue.py # Lock-free MPMC & Chase-Lev deques
├── work_stealing.py  # Work-stealing scheduler (Tokio/Go/FJP)
├── adaptive.py       # Adaptive pressure-driven scheduler
├── engine.py         # Unified orchestrator
├── cli.py            # Command-line interface
├── _native/          # Optional Cython accelerators
│   ├── _fast_deque.pyx
│   └── setup_cython.py
└── bindings/
    ├── nodejs/       # npm client for Node.js / TypeScript
    └── rust/         # PyO3 Rust native extension
        └── pyaccelerate_native/

Examples

The examples/ directory contains runnable scripts demonstrating all features:

Example Description
example_basic.py Engine creation, summary, submit, run_parallel, batch
example_parallel_io.py Parallel download/process/write with public UCI ML datasets
example_cpu_bound.py Sequential vs thread pool vs process pool comparison
example_max_mode.py MaxMode context manager, run_all, run_io, run_cpu, pipeline
example_pipeline.py Multi-stage data pipeline (download → analyze → report)
example_priority.py TaskPriority levels, EnergyProfile, presets, benchmarking
cd examples
python example_basic.py
python example_max_mode.py
python example_priority.py

Roadmap

  • IoT / SBC detection (Raspberry Pi, Jetson, Coral, Hailo)
  • Auto-tuning feedback loop (benchmark → config → re-tune)
  • Prometheus metrics exporter
  • gRPC server mode for multi-language integration
  • Kubernetes operator for auto-scaling GPU workloads
  • npm package (Node.js bindings via HTTP API)
  • Work-stealing scheduler (Tokio / Go / ForkJoinPool style)
  • Lock-free task queues (Chase-Lev deques, MPMC)
  • Adaptive scheduler (latency, CPU & memory pressure)
  • Optional native accelerators (Cython + Rust/PyO3)
  • Benchmark suite with IO/CPU/mixed workload comparisons

Origin

Evolved from the acceleration & virtual-thread systems built for:

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyaccelerate-0.7.1.tar.gz (140.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyaccelerate-0.7.1-py3-none-any.whl (123.0 kB view details)

Uploaded Python 3

File details

Details for the file pyaccelerate-0.7.1.tar.gz.

File metadata

  • Download URL: pyaccelerate-0.7.1.tar.gz
  • Upload date:
  • Size: 140.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyaccelerate-0.7.1.tar.gz
Algorithm Hash digest
SHA256 8ea082afd20f8198733c8c7a023544e8dec5bc6f1776590e4ab60a42b7d15cd2
MD5 c428b4f4bba8c43170ac3979bd484418
BLAKE2b-256 f96cce412b560390bf30c06d4924288ec1e0b78d7f0bb027790083a58ca756be

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyaccelerate-0.7.1.tar.gz:

Publisher: publish.yml on GuilhermeP96/pyaccelerate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyaccelerate-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: pyaccelerate-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 123.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyaccelerate-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6bdd7dc1462895b5c122c1b9a921706dc953e7e84768da949246992010dcdc7e
MD5 e77b8ce99f1e646d1f81403776bb4c64
BLAKE2b-256 5156e5b7950a0471018e08108eaa810a706dd396e64852036334eae2c87467b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyaccelerate-0.7.1-py3-none-any.whl:

Publisher: publish.yml on GuilhermeP96/pyaccelerate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page