Skip to main content

ROCm-native local LLM inference engine with a torch-free runtime hot path

Project description

hipEngine

hipEngine is a ROCm-native local LLM inference engine designed from the ground up for AMD RDNA GPUs (starting with gfx1100, gfx1151). It pairs a small purpose-built Python host with a complete suite of custom-tuned HIP kernels developed through 100+ iterations of profiling and tuning.

hipEngine has lightweight dependencies with no PyTorch required for fully supported GPUs and models.

Core principles

  • HIP-first, not CUDA-ported. Kernels directly target AMD hardware like gfx1100/RDNA3 with wave32, vec8 FMA, and the actual cache hierarchy.
  • Torch-free runtime. import torch is not on the hot path. The runtime owns a thin hipengine.Tensor over raw HIP/CUDA device pointers and drives hipblasLt, hipGraph, AOTriton, and JIT builds through ctypes. Torch appears only as an optional dlpack bridge behind the hipengine[torch] extra (~125 MiB install including the vendored AOTriton subset vs ~2 GiB with torch).
  • Multi-backend from day one. Kernels live under kernels/hip_gfx1100/, kernels/hip_gfx1151/, kernels/cuda_sm86/, kernels/cpu_reference/ as peer trees.
  • Four-axis plugin registry. Kernels are keyed by (backend, layer, quant, variant). Models, quant schemes, and layers are plugins. No if backend == "..." or if quant == "..." branches in dispatch / engine / model code.
  • Fused + unfused coexist. Every fused composite (rmsnorm+rotate, gate_combine_residual, …) has a numerically-equivalent unfused chain registered under its primitives, used as both fallback and correctness baseline.
  • Evidence-backed performance. Every performance claim ships with model + quant + workload shape + hardware + exact command + correctness gate (KL ≤ 0.05, top-1 ≥ 90% vs kernels/cpu_reference/). See docs/BENCHMARK.md and benchmarks/README.md.

Status

v0.1.x. The runtime hot path is torch-free by construction; kernel families and registry plumbing are landing under hipengine/kernels/hip_gfx1100/. Current single-model tuning targets shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed (19.07 GiB, 4.68 bpw) in packed ParoQuant format.

gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900)

While we are far from gfx1100 roofline, the current gfx1100 implementation does well compared to Q4_K_M quants of recent llama.cpp builds (b9042) on the same model family. The latest W7900 packed rows use the default prefill policy: 512-token prompts stay unchunked and prompts above 1K use 1024/1024/4096/1024/1024 chunks.

Prefill tok/s

Workload hipEngine shisa Qwen3.6 packed PARO llama.cpp HIP llama.cpp Vulkan
512/128 2500.565 2436.049 1816.927
4K/128 2899.685 2176.905 1705.093
32K/128 2115.050 1496.409 1128.554
128K/128 1054.291 710.213 480.539

Decode tok/s

Workload hipEngine shisa Qwen3.6 packed PARO llama.cpp HIP llama.cpp Vulkan
512/128 111.516 85.487 127.515
4K/128 113.094 87.375 120.163
32K/128 97.594 76.994 98.073
128K/128 62.027 57.341 64.478

Peak GiB

Workload hipEngine shisa Qwen3.6 packed PARO llama.cpp HIP llama.cpp Vulkan
512/128 18.123 21.125 20.844
4K/128 19.455 21.197 20.969
32K/128 20.267 21.738 21.533
128K/128 23.235 23.605 23.596

gfx1151 (AMD Ryzen AI MAX+ 395 / Radeon 8060S)

The gfx1151 backend is a native --offload-arch=gfx1151 peer backend using the same registry-keyed kernel surface. The Strix Halo snapshot below uses 256-row prefill chunks, which removed the 4K prefill gap without hurting long-context decode.

Prefill tok/s

Workload hipEngine shisa Qwen3.6 packed PARO llama.cpp HIP llama.cpp Vulkan
512/128 983.206 1058.738 638.008
4K/128 1029.402 1004.220 595.400
32K/128 792.296 735.534 407.984
128K/128 413.489 376.070 181.453

Decode tok/s

Workload hipEngine shisa Qwen3.6 packed PARO llama.cpp HIP llama.cpp Vulkan
512/128 62.060 50.537 57.615
4K/128 63.605 49.379 55.027
32K/128 50.629 43.435 44.576
128K/128 30.245 31.286 26.935

On Strix Halo, rocm-smi / sysfs expose only a 512 MiB VRAM aperture, so cross-engine memory comparisons are omitted here. The hipEngine allocator high-water mark for the chunk256 sweep was 17.997 GiB (512/128), 18.097 GiB (4K/128), 18.909 GiB (32K/128), and 21.877 GiB (128K/128).

See benchmarks/README.md for full protocol details, correctness status, source-lineage targets, and external comparison baselines.

Hardware targets

Backend Hardware Status
hip_gfx1100 AMD Radeon Pro W7900 / RX 7900 XTX (RDNA3) Primary, in active bring-up
hip_gfx1151 AMD Ryzen AI MAX+ 395 / Radeon 8060S (Strix Halo, RDNA3.5) Active backend
cuda_sm86 NVIDIA Ampere consumer (3090-class) Planned peer backend
cpu_reference Any CPU, numpy Correctness oracle; CI without GPU

backend="auto" is the public API/server default. It maps exact gfx1100 and gfx1151 detections to the matching HIP backend; unknown ROCm targets warn and select cpu_reference where a CPU implementation exists. Users on nearby targets such as gfx1101/gfx1102 can force a backend with backend="hip_gfx1100", --backend hip_gfx1100, or HIPENGINE_BACKEND=hip_gfx1100 after validating correctness/performance.

Wave32 is the default for hip_gfx1100 device code; wave64 is treated as an isolated experiment with its own gates (see docs/PLAN.md).

Architecture at a glance

┌─────────────────────────────────────────────────────────────────┐
│  USER API                                                       │
│  hipengine.LLM.generate()           library API                 │
│  hipengine.server                   optional [server] extra     │
├─────────────────────────────────────────────────────────────────┤
│  LOADING (torch-free)                                           │
│  safetensors mmap + hipMemcpyAsync / HF config / jinja2 chat    │
│  templates / HF tokenizers (Rust)                               │
├─────────────────────────────────────────────────────────────────┤
│  DISPATCH                                                       │
│  Scheduler / Block Manager (KVPolicy) / Prefix Cache            │
│  Fusion Planner (chain → kernel plan, fused preferred)          │
│  Model / Quant / Layer plugins / Engine loop (hipGraph replay)  │
├─────────────────────────────────────────────────────────────────┤
│  CORE (torch-free primitives)                                   │
│  hipengine.Tensor / device / memory / stream / graph / blas     │
│  build (hipcc subprocess + ctypes.CDLL + .so cache)             │
├─────────────────────────────────────────────────────────────────┤
│  KERNELS (backend-keyed, 120 __global__ in the Qwen/PARO port)  │
│  kernels/hip_gfx1100/  attention / linear_attn / moe / quant    │
│                        wmma / norm / rotary / fused             │
│  kernels/hip_gfx1151/  native target-arch peer backend           │
│  kernels/cuda_sm86/    (future)                                 │
│  kernels/cpu_reference/ correctness oracle, no GPU required     │
└─────────────────────────────────────────────────────────────────┘

Full layer diagram, plugin axes, KV cache ABI, and roadmap are in docs/PLAN.md.

Installation

# one-time: fetch Git LFS payloads, including the vendored AOTriton runtime/images
git lfs install
git lfs pull

# core runtime (torch-free)
pip install -e .

# with the OpenAI-compatible server
pip install -e ".[server]"

# with the optional dlpack torch bridge for user-boundary interop
pip install -e ".[torch]"

# dev / test
pip install -e ".[dev]"

Python 3.11+. A working ROCm install with libamdhip64.so on the loader path is required for any GPU run; CPU-reference correctness tests run without a GPU.

Quickstart (Phase 0 — bring-up only)

The public API surface is stable:

from hipengine import LLM, SamplingParams

llm = LLM("/path/to/model", quant="w4_paro")  # backend="auto" by default
outputs = llm.generate(
    ["Hello, hipEngine."],
    SamplingParams(max_tokens=64, temperature=0.0),
)
print(outputs[0])

Today LLM.generate() only resolves to narrow Qwen3.5 / PARO bring-up paths registered in hipengine.generation; unsupported (model, backend, quant) combinations fail loudly rather than falling back to a generic torch path. See docs/PLAN.md for the model / quant roadmap.

OpenAI-compatible server

Install the optional server extra and run the FastAPI layer:

pip install -e ".[server]"
python -m hipengine.server \
  --model /path/to/model \
  --quant w4_paro \
  --served-model-name qwen-paro

Supported v0.1 endpoints: GET /v1/models, POST /v1/completions, and POST /v1/chat/completions (including one-chunk SSE for stream=true). See docs/API.md for request examples, bearer-token auth, and current limitations.

Documentation

File Purpose
docs/PLAN.md Architecture, plugin axes, phase roadmap, LoC budgets
docs/BENCHMARK.md Benchmark protocols, baselines, correctness gate, artifact format
docs/TESTING.md RED/GREEN workflow, correctness oracles, fixture policy
docs/KERNELS.md Kernel catalog, source-lineage drift workflow, JIT cache gotchas, build profiles
docs/ROOFLINE.md RDNA3 / W7900 performance model and decision tree
docs/IMPLEMENTATION.md Implementation status and concrete milestones
docs/API.md OpenAI-compatible server usage and endpoint support
docs/PREFILL.md Native prefill implementation spec
docs/MTP.md Multi-token prediction plan
docs/DFLASH.md DFlash draft-model speculative decode plan
benchmarks/README.md Current-fastest rollup and external comparison baselines
AGENTS.md Ground rules for every coding / review / benchmarking task
WORKLOG.md Append-only cross-session journal of decisions and measurements

Development

# narrowest test suite (CPU-only paths run without a GPU)
pytest -q

# kernel source-lineage drift check before any port
python3 scripts/check_lineage.py --kind kernel --diff stat

See AGENTS.md for the full workflow: when to run the CPU-reference correctness gate, when to add a rocprofv3 --kernel-trace smoke, and what a retained benchmark row requires.

References & lineage

hipEngine is not a fork of any project; it is a brand new codebase with from-scratch code and kernels. Of course it builds on the work of many others:

  • ROCm - of course this all sits on AMD's open-source compute stack, notably on HIP.
  • Nano-vLLM - most of the original kernel tuning iteration loops used this as a host-layer. Some of the performance limitations of the architecture motivated the hipEngine rewrite, but we remain greatful and deeply appreciative of nano-vllm as a great research platform.
  • ParoQuant - after reviewing the current SOTA on model quantization, we chose ParoQuant as the first target due to both its excellent accuracy and its efficiency (QTIP/YAQA is very cool but proved challenging to implement performant RDNA3 kernels)
  • FastDMS - our KVCache ABI is shaped by the lessons learned from building our DMS reference implementation.

Greetz: hipfire, Lucebox, DS4, ExLlamaV3 and ofc the og llama.cpp

See also: Marlin, kernel-anvil, wmma_ops, tilelang, fsr4-rdna3-optimization, ROCm examples

License

hipEngine source code is licensed under AGPL-3.0-or-later. It is built and distributed for anyone who has an AMD card that hasn't been living up to its compute potential.

Model weights, checkpoints, and external datasets remain under their own licenses.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hipengine-0.1.0.tar.gz (12.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hipengine-0.1.0-py3-none-manylinux_2_39_x86_64.whl (16.9 MB view details)

Uploaded Python 3manylinux: glibc 2.39+ x86-64

File details

Details for the file hipengine-0.1.0.tar.gz.

File metadata

  • Download URL: hipengine-0.1.0.tar.gz
  • Upload date:
  • Size: 12.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.9

File hashes

Hashes for hipengine-0.1.0.tar.gz
Algorithm Hash digest
SHA256 600c88880aaddc1a2a2d6f17d80cd8fd1197e6dafb74662dc8150e9cd2d1d030
MD5 bf68e8ac0f74080f3746fb9a35dccad3
BLAKE2b-256 1785202f8b1450062dfcb65a04f9e13fbe3d08b9ea7fc365d93f508148d83fa5

See more details on using hashes here.

File details

Details for the file hipengine-0.1.0-py3-none-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for hipengine-0.1.0-py3-none-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 9c7d612941575504114037da466fc0acb5c8bec1fd81d945bee89390c1c786fa
MD5 fb2660587021c5cf2996f0e6f06c9280
BLAKE2b-256 6328dca68ba70a73da5d56c7500e544ba45742d8f25abfb52a98bf4a82e42d32

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page