ROCm-native local LLM inference engine with a torch-free runtime hot path
Project description
hipEngine
hipEngine is a ROCm-native local LLM inference engine designed from the ground up for AMD RDNA GPUs (starting with gfx1100, gfx1151). It pairs a small purpose-built Python host with a complete suite of custom-tuned HIP kernels developed through 100+ iterations of profiling and tuning.
hipEngine has lightweight dependencies with no PyTorch required for fully supported GPUs and models.
Core principles
- HIP-first, not CUDA-ported. Kernels directly target AMD hardware like gfx1100/RDNA3 with wave32, vec8 FMA, and the actual cache hierarchy.
- Torch-free runtime.
import torchis not on the hot path. The runtime owns a thinhipengine.Tensorover raw HIP/CUDA device pointers and driveshipblasLt,hipGraph, AOTriton, and JIT builds throughctypes. Torch appears only as an optional dlpack bridge behind thehipengine[torch]extra (~125 MiB install including the vendored AOTriton subset vs ~2 GiB with torch). - Multi-backend from day one. Kernels live under
kernels/hip_gfx1100/,kernels/hip_gfx1151/,kernels/cuda_sm86/,kernels/cpu_reference/as peer trees. - Four-axis plugin registry. Kernels are keyed by
(backend, layer, quant, variant). Models, quant schemes, and layers are plugins. Noif backend == "..."orif quant == "..."branches in dispatch / engine / model code. - Fused + unfused coexist. Every fused composite
(
rmsnorm+rotate,gate_combine_residual, …) has a numerically-equivalent unfused chain registered under its primitives, used as both fallback and correctness baseline. - Evidence-backed performance. Every performance claim ships with
model + quant + workload shape + hardware + exact command + correctness gate
(KL ≤ 0.05, top-1 ≥ 90% vs
kernels/cpu_reference/). Seedocs/BENCHMARK.mdandbenchmarks/README.md.
Status
v0.2.0 alpha. The runtime hot path is torch-free by construction, and the
first two 35B-class model-loading surfaces are now available on gfx1100:
shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed
(19.07 GiB, 4.68 bpw) in packed
ParoQuant format, plus Qwen3.6 GGUF
Q4_K_M / Q4_K_S files through the new resident GGUF path.
- INT8 KV cache support has been added for PARO. Qwen 3 MoE's full 256K context window can fit in <24GB tracked memory; see Memory Usage.
- Qwen 3.6 Q4_K_M and Q4_K_S GGUF support has landed (W7900 Q4_K_S sweep is in Performance alongside packed PARO and llama.cpp Q4_K_M HIP/Vulkan baselines). GGUF uses a substantial GGUF-specific runtime path with bulk prefill, graph decode, and on-load decode-repack into T16 tile layouts. Q4_K_S is recommended on 24 GiB cards because Q4_K_M is bigger; on the 48 GiB W7900 Q4_K_S fits all the way to 128K context, while on 24 GiB cards expect roughly 64K. GGUF also has a higher per-session load cost (~60 s vs ~24 s for PARO packed on the same hardware) for the same decode-repack reason.
- Current gfx1100 performance snapshots are summarized in Performance and compared against recent llama.cpp Q4_K_M baselines.
Hardware targets
| Backend | Hardware | Status |
|---|---|---|
cpu_reference |
Any CPU, numpy | Correctness oracle; CI without GPU |
hip_gfx1100 |
AMD Radeon Pro W7900 / RX 7900 XTX (RDNA3) | Active backend |
hip_gfx1151 |
AMD Ryzen AI MAX+ 395 / Radeon 8060S (Strix Halo, RDNA3.5) | Active backend |
cuda_sm86 |
NVIDIA Ampere consumer (3090-class) | Planned peer backend |
backend="auto" is the public API/server default. It maps exact gfx1100 and
gfx1151 detections to the matching HIP backend; unknown ROCm targets warn and
select cpu_reference where a CPU implementation exists. Users on nearby targets
such as gfx1101/gfx1102 can force a backend with backend="hip_gfx1100",
--backend hip_gfx1100, or HIPENGINE_BACKEND=hip_gfx1100 after validating
correctness/performance.
Wave32 is the default for hip_gfx1100 device code; wave64 is treated as an
isolated experiment with its own gates (see
docs/PLAN.md).
Memory Usage
With BF16 KV cache, hipEngine running the packed Qwen 3.6 PARO model fits a
128K context window in a 24GB-class memory budget. The INT8 KV cache option
(with FP16 per-token/per-head scales) uses the
--kv-storage int8_per_token_head flag and lets the full 256K context fit
under 24 GiB tracked allocator peak.
The numbers below are for
shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed on W7900/gfx1100 with q3072
full-attention prefill chunks:
| Model | Context | KV cache | Sampled peak | Allocator peak | Retained KV | Prefill | Decode |
|---|---|---|---|---|---|---|---|
| Qwen3.6 35B-A3B PARO | 128K | BF16 | 21.04 GiB | 21.88 GiB | 2.69 GiB | 1091.9 tok/s | 62.2 tok/s |
| Qwen3.6 35B-A3B PARO | 128K | INT8 | 19.80 GiB | 20.89 GiB | 1.36 GiB | 1076.5 tok/s | 60.0 tok/s |
| Qwen3.6 35B-A3B PARO | 256K | INT8 | 21.96 GiB | 23.71 GiB | 2.71 GiB | 670.2 tok/s | 40.3 tok/s |
Regardless of the difference in PARO weight storage (legacy or packed), loaded-weight memory is about the same — approximately 16.4 GiB in VRAM.
The INT8 KV correctness gate is currently the deterministic Qwen3.5 PARO
fixture fixtures/qwen35_paro/parent_512_32_seed1234.json (512-token prompt,
32 greedy decode tokens): max_kl=0.015328, mean_kl=0.001639, top-1 agreement
100%, and generated IDs match BF16 KV exactly. Layer attention probes at context
64 and 520 also had top-1 agreement 100% with max quantized-vs-BF16 KL
2.34e-7. This is a fixture/regression gate, not a long-rollout perplexity
study, so long context generations may have unmeasured compounding errors.
The same 128K/128 Qwen3.5 BF16-vs-INT8 run measured -0.99% prefill tok/s and -3.20% decode tok/s for INT8 KV, so speed loss is also very small.
See
benchmarks/results/2026-05-19-hipengine-qwen36-packed-int8-kv-readme-memory-diagnostic.json,
benchmarks/README.md,
and docs/KVCACHE.md for commands, artifacts, and the full
no-shadow memory audit.
llama.cpp
When run with q8_0 kvcache, llama.cpp can also fit in 24GB:
--flash-attn on -ctk q8_0 -ctv q8_0 -c 262144 -b 128 -ub 128
Results:
| Model | llama.cpp model buffer | KV cache | Compute buffer | rocm-smi VRAM used | Free VRAM |
|---|---|---|---|---|---|
| Q4_K_M | 20583 MiB | 2720 MiB | 203 MiB | 24017 MiB / 23.45 GiB | ~543 MiB |
| Q4_K_S | 19399 MiB | 2720 MiB | 203 MiB | 22832 MiB / 22.30 GiB | ~1728 MiB |
With -ub 512:
| Model | Compute buffer | rocm-smi VRAM used | Free VRAM |
|---|---|---|---|
| Q4_K_M | 812 MiB | 24540 MiB | ~20 MiB |
| Q4_K_S | 812 MiB | 23443 MiB | ~1117 MiB |
- Note Q4_K_M is incredibly tight with only 20 MiB of headroom and you may either need to resize down or set
-b 512 -ub 128. - Q4_K_S does not need small
-b/-ub;-ub 512fits fine, and can even increase to-b 2048(but-ubis the more important VRAM knob that controls the physical microbatch / compute buffer size for llama.cpp).
Performance
gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900)
While we are far from gfx1100 roofline, the current gfx1100 implementation does well compared to Q4_K_M quants of recent llama.cpp builds (b9042) on the same model family. The latest W7900 hipEngine rows use TheRock ROCm 7.13 and load each resident model once for 1 warmup + 5 measured in-session repetitions per shape. PARO uses the default prefill policy: 512-token prompts stay unchunked and prompts above 1K use 1024/1024/4096/1024/1024 chunks. The hipEngine GGUF Q4_K_S column uses the same chunked-prefill policy plus the WMMA prefill + GEMV decode fast paths and the persistent on-load decode-repack into T16 tile layouts.
Prefill tok/s
| Workload | hipEngine PARO | hipEngine GGUF Q4_K_S | llama.cpp HIP | llama.cpp Vulkan |
|---|---|---|---|---|
| 512/128 | 2718.497 | 2258.847 | 2436.049 | 1816.927 |
| 4K/128 | 2838.773 | 2576.673 | 2176.905 | 1705.093 |
| 32K/128 | 2074.699 | 1893.967 | 1496.409 | 1128.554 |
| 128K/128 | 1055.454 | 998.143 | 710.213 | 480.539 |
Decode tok/s
| Workload | hipEngine PARO | hipEngine GGUF Q4_K_S | llama.cpp HIP | llama.cpp Vulkan |
|---|---|---|---|---|
| 512/128 | 103.460 | 109.152 | 85.487 | 127.515 |
| 4K/128 | 101.964 | 100.048 | 87.375 | 120.163 |
| 32K/128 | 90.438 | 86.774 | 76.994 | 98.073 |
| 128K/128 | 59.598 | 57.954 | 57.341 | 64.478 |
Peak GiB
| Workload | hipEngine PARO | hipEngine GGUF Q4_K_S | llama.cpp HIP | llama.cpp Vulkan |
|---|---|---|---|---|
| 512/128 | 20.962 | 25.108 | 21.125 | 20.844 |
| 4K/128 | 21.906 | 25.108 | 21.197 | 20.969 |
| 32K/128 | 22.016 | 25.108 | 21.738 | 21.533 |
| 128K/128 | 22.122 | 25.108 | 23.605 | 23.596 |
hipEngine W7900 row source: benchmarks/results/2026-05-25-w7900-hipengine-readme-persistent-5run-diagnostic.json. Both hipEngine columns are 5-run medians from one resident session allocated for the maximum requested context (128K/128), so the peak-memory column is a max-context persistent-session high-water mark rather than each shape's minimum allocation. Existing W7900 llama.cpp HIP/Vulkan Q4_K_M rows are reused unchanged. The hipEngine GGUF Q4_K_S column is compared against the existing llama.cpp Q4_K_M baselines because that is the lineage of measured baselines we have on this host; cross-quant comparisons should be read as approximate.
gfx1151 (AMD Ryzen AI MAX+ 395 / Radeon 8060S)
The gfx1151 backend is a native --offload-arch=gfx1151 peer backend using the same registry-keyed kernel surface. The Strix Halo snapshot below uses 256-row prefill chunks, which removed the 4K prefill gap without hurting long-context decode.
Prefill tok/s
| Workload | hipEngine PARO | llama.cpp HIP | llama.cpp Vulkan |
|---|---|---|---|
| 512/128 | 983.206 | 1058.738 | 638.008 |
| 4K/128 | 1029.402 | 1004.220 | 595.400 |
| 32K/128 | 792.296 | 735.534 | 407.984 |
| 128K/128 | 413.489 | 376.070 | 181.453 |
Decode tok/s
| Workload | hipEngine PARO | llama.cpp HIP | llama.cpp Vulkan |
|---|---|---|---|
| 512/128 | 62.060 | 50.537 | 57.615 |
| 4K/128 | 63.605 | 49.379 | 55.027 |
| 32K/128 | 50.629 | 43.435 | 44.576 |
| 128K/128 | 30.245 | 31.286 | 26.935 |
On Strix Halo, rocm-smi / sysfs expose only a 512 MiB VRAM aperture, so cross-engine memory comparisons are omitted here. The hipEngine allocator high-water mark for the chunk256 sweep was 17.997 GiB (512/128), 18.097 GiB (4K/128), 18.909 GiB (32K/128), and 21.877 GiB (128K/128).
See benchmarks/README.md for full protocol details,
correctness status, source-lineage targets, and external comparison baselines.
GGUF Support
As of v0.2.0, hipEngine includes resident Qwen3.6 GGUF support for Q4_K_M and
Q4_K_S model files (with more formats planned). This is a major runtime path,
not just a loader shim: GGUF has its own quant readers, bulk-prefill path,
decode-repacked T16 layouts, and fast-path controls.
Current caveats:
- PARO models take ~24s to load on the W7900 test host; GGUF currently takes about 60s because decode-repack happens on load. On-disk caching could reduce startup time later, but would require additional storage for repacked layouts.
- GGUF has higher resident memory than packed PARO. In the current W7900 README sweep, the max-context Q4_K_S session peaks at ~25.1 GiB tracked, so 128K is W7900/48 GiB territory; on 24 GiB cards, expect roughly 64K context with Q4_K_S.
- GGUF is close enough to PARO to share some high-level scheduling ideas, but in practice it needs substantial GGUF-only kernels and dispatch. The goal for future releases is to keep closing the remaining PARO/GGUF speed gap.
Architecture at a glance
┌─────────────────────────────────────────────────────────────────┐
│ USER API │
│ hipengine.LLM.generate() library API │
│ hipengine.server optional [server] extra │
├─────────────────────────────────────────────────────────────────┤
│ LOADING (torch-free) │
│ safetensors mmap + hipMemcpyAsync / HF config / jinja2 chat │
│ templates / HF tokenizers (Rust) │
├─────────────────────────────────────────────────────────────────┤
│ DISPATCH │
│ Scheduler / Block Manager (KVPolicy) / Prefix Cache │
│ Fusion Planner (chain → kernel plan, fused preferred) │
│ Model / Quant / Layer plugins / Engine loop (hipGraph replay) │
├─────────────────────────────────────────────────────────────────┤
│ CORE (torch-free primitives) │
│ hipengine.Tensor / device / memory / stream / graph / blas │
│ build (hipcc subprocess + ctypes.CDLL + .so cache) │
├─────────────────────────────────────────────────────────────────┤
│ KERNELS (backend-keyed, 120 __global__ in the Qwen/PARO port) │
│ kernels/hip_gfx1100/ attention / linear_attn / moe / quant │
│ wmma / norm / rotary / fused │
│ kernels/hip_gfx1151/ native target-arch peer backend │
│ kernels/cuda_sm86/ (future) │
│ kernels/cpu_reference/ correctness oracle, no GPU required │
└─────────────────────────────────────────────────────────────────┘
Full layer diagram, plugin axes, KV cache ABI, and roadmap are in
docs/PLAN.md.
Installation
# one-time: fetch Git LFS payloads, including the vendored AOTriton runtime/images
git lfs install
git lfs pull
# core runtime (torch-free)
pip install -e .
# with the OpenAI-compatible server
pip install -e ".[server]"
# with the optional dlpack torch bridge for user-boundary interop
pip install -e ".[torch]"
# dev / test
pip install -e ".[dev]"
Python 3.11+. A working ROCm install with libamdhip64.so on the loader path
is required for any GPU run; CPU-reference correctness tests run without a GPU.
Quickstart (Phase 0 — bring-up only)
The public API surface is stable:
from hipengine import LLM, SamplingParams
llm = LLM("/path/to/model", quant="w4_paro") # backend="auto" by default
outputs = llm.generate(
["Hello, hipEngine."],
SamplingParams(max_tokens=64, temperature=0.0),
)
print(outputs[0])
Today LLM.generate() only resolves to narrow Qwen3.5 / PARO bring-up paths
registered in hipengine.generation; unsupported (model, backend, quant)
combinations fail loudly rather than falling back to a generic torch path. See
docs/PLAN.md for the model / quant roadmap.
OpenAI-compatible server
Install the optional server extra and run the FastAPI layer:
pip install -e ".[server]"
python -m hipengine.server \
--model /path/to/model \
--quant w4_paro \
--served-model-name qwen-paro
Supported v0.1 endpoints: GET /v1/models, POST /v1/completions, and
POST /v1/chat/completions (including one-chunk SSE for stream=true). See
docs/API.md for request examples, bearer-token auth, and
current limitations.
Documentation
| File | Purpose |
|---|---|
docs/PLAN.md |
Architecture, plugin axes, phase roadmap, LoC budgets |
docs/BENCHMARK.md |
Benchmark protocols, baselines, correctness gate, artifact format |
docs/TESTING.md |
RED/GREEN workflow, correctness oracles, fixture policy |
docs/KERNELS.md |
Kernel catalog, source-lineage drift workflow, JIT cache gotchas, build profiles |
docs/ENVS.md |
Environment variables, TheRock setup, benchmark/profiling profiles |
docs/ROOFLINE.md |
RDNA3 / W7900 performance model and decision tree |
docs/IMPLEMENTATION.md |
Implementation status and concrete milestones |
docs/API.md |
OpenAI-compatible server usage and endpoint support |
docs/PREFILL.md |
Native prefill implementation spec |
docs/MTP.md |
Multi-token prediction plan |
docs/DFLASH.md |
DFlash draft-model speculative decode plan |
benchmarks/README.md |
Current-fastest rollup and external comparison baselines |
AGENTS.md |
Ground rules for every coding / review / benchmarking task |
WORKLOG.md |
Append-only cross-session journal of decisions and measurements |
Development
# narrowest test suite (CPU-only paths run without a GPU)
pytest -q
# kernel source-lineage drift check before any port
python3 scripts/check_lineage.py --kind kernel --diff stat
See AGENTS.md for the full workflow: when to run the
CPU-reference correctness gate, when to add a rocprofv3 --kernel-trace smoke,
and what a retained benchmark row requires.
References & lineage
hipEngine is not a fork of any project; it is a brand new codebase with from-scratch code and kernels. Of course it builds on the work of many others:
- ROCm - of course this all sits on AMD's open-source compute stack, notably on HIP.
- Nano-vLLM - most of the original kernel tuning iteration loops used this as a host-layer. Some of the performance limitations of the architecture motivated the hipEngine rewrite, but we remain greatful and deeply appreciative of nano-vllm as a great research platform.
- ParoQuant - after reviewing the current SOTA on model quantization, we chose ParoQuant as the first target due to both its excellent accuracy and its efficiency (QTIP/YAQA is very cool but proved challenging to implement performant RDNA3 kernels)
- FastDMS - our KVCache ABI is shaped by the lessons learned from building our DMS reference implementation.
Greetz: hipfire, Lucebox, DS4, ExLlamaV3 and ofc the og llama.cpp
See also: Marlin, kernel-anvil, wmma_ops, tilelang, fsr4-rdna3-optimization, ROCm examples
License
hipEngine source code is licensed under AGPL-3.0-or-later. It is built and distributed for anyone who has an AMD card that hasn't been living up to its compute potential.
Model weights, checkpoints, and external datasets remain under their own licenses.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hipengine-0.2.0.tar.gz.
File metadata
- Download URL: hipengine-0.2.0.tar.gz
- Upload date:
- Size: 13.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
314431c671e297f9dc2da1d592bb2a17631d7c0a8a1164b14bcaa22854fea639
|
|
| MD5 |
8c21fd61d7483233302fc84401230f7b
|
|
| BLAKE2b-256 |
22692de2e27de3fcaf3e3899d442527871dc7591516fcf098abc854a2ff6fb53
|
Provenance
The following attestation bundles were made for hipengine-0.2.0.tar.gz:
Publisher:
publish.yml on shisa-ai/hipEngine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hipengine-0.2.0.tar.gz -
Subject digest:
314431c671e297f9dc2da1d592bb2a17631d7c0a8a1164b14bcaa22854fea639 - Sigstore transparency entry: 1624902189
- Sigstore integration time:
-
Permalink:
shisa-ai/hipEngine@36e351607955158e821b51f443bac042b140624f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/shisa-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@36e351607955158e821b51f443bac042b140624f -
Trigger Event:
push
-
Statement type:
File details
Details for the file hipengine-0.2.0-py3-none-manylinux_2_39_x86_64.whl.
File metadata
- Download URL: hipengine-0.2.0-py3-none-manylinux_2_39_x86_64.whl
- Upload date:
- Size: 17.2 MB
- Tags: Python 3, manylinux: glibc 2.39+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84783a8f5069c0cf16a163b255b0d34f44501bf93d8bb2327718ec749a1294d9
|
|
| MD5 |
6452d4f983c66d59ec1ae1bfee9eaa9e
|
|
| BLAKE2b-256 |
d252b181bac6783db2394c74c39156fe4557c2e626ef6eb9c01b71212347739b
|
Provenance
The following attestation bundles were made for hipengine-0.2.0-py3-none-manylinux_2_39_x86_64.whl:
Publisher:
publish.yml on shisa-ai/hipEngine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hipengine-0.2.0-py3-none-manylinux_2_39_x86_64.whl -
Subject digest:
84783a8f5069c0cf16a163b255b0d34f44501bf93d8bb2327718ec749a1294d9 - Sigstore transparency entry: 1624902230
- Sigstore integration time:
-
Permalink:
shisa-ai/hipEngine@36e351607955158e821b51f443bac042b140624f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/shisa-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@36e351607955158e821b51f443bac042b140624f -
Trigger Event:
push
-
Statement type: