Skip to main content

ROCm-native local LLM inference engine with a torch-free runtime hot path

Project description

hipEngine

hipEngine is a ROCm-native local LLM inference engine designed from the ground up for AMD RDNA GPUs (starting with gfx1100, gfx1151). It pairs a small purpose-built Python host with a complete suite of custom-tuned HIP kernels developed through 100+ iterations of profiling and tuning.

hipEngine has lightweight dependencies with no PyTorch required for fully supported GPUs and models.

Core principles

  • HIP-first, not CUDA-ported. Kernels directly target AMD hardware like gfx1100/RDNA3 with wave32, vec8 FMA, and the actual cache hierarchy.
  • Torch-free runtime. import torch is not on the hot path. The runtime owns a thin hipengine.Tensor over raw HIP/CUDA device pointers and drives hipblasLt, hipGraph, AOTriton, and JIT builds through ctypes. Torch appears only as an optional dlpack bridge behind the hipengine[torch] extra (~125 MiB install including the vendored AOTriton subset vs ~2 GiB with torch).
  • Multi-backend from day one. Kernels live under kernels/hip_gfx1100/, kernels/hip_gfx1151/, kernels/cuda_sm86/, kernels/cpu_reference/ as peer trees.
  • Four-axis plugin registry. Kernels are keyed by (backend, layer, quant, variant). Models, quant schemes, and layers are plugins. No if backend == "..." or if quant == "..." branches in dispatch / engine / model code.
  • Fused + unfused coexist. Every fused composite (rmsnorm+rotate, gate_combine_residual, …) has a numerically-equivalent unfused chain registered under its primitives, used as both fallback and correctness baseline.
  • Evidence-backed performance. Every performance claim ships with model + quant + workload shape + hardware + exact command + correctness gate (KL ≤ 0.05, top-1 ≥ 90% vs kernels/cpu_reference/). See docs/BENCHMARK.md and benchmarks/README.md.

Status

v0.2.0 alpha. The runtime hot path is torch-free by construction, and the first two 35B-class model-loading surfaces are now available on gfx1100: shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed (19.07 GiB, 4.68 bpw) in packed ParoQuant format, plus Qwen3.6 GGUF Q4_K_M / Q4_K_S files through the new resident GGUF path.

  • INT8 KV cache support has been added for PARO. Qwen 3 MoE's full 256K context window can fit in <24GB tracked memory; see Memory Usage.
  • Qwen 3.6 Q4_K_M and Q4_K_S GGUF support has landed (W7900 Q4_K_S sweep is in Performance alongside packed PARO and llama.cpp Q4_K_M HIP/Vulkan baselines). GGUF uses a substantial GGUF-specific runtime path with bulk prefill, graph decode, and on-load decode-repack into T16 tile layouts. Q4_K_S is recommended on 24 GiB cards because Q4_K_M is bigger; on the 48 GiB W7900 Q4_K_S fits all the way to 128K context, while on 24 GiB cards expect roughly 64K. GGUF also has a higher per-session load cost (~60 s vs ~24 s for PARO packed on the same hardware) for the same decode-repack reason.
  • Current gfx1100 performance snapshots are summarized in Performance and compared against recent llama.cpp Q4_K_M baselines.

Hardware targets

Backend Hardware Status
cpu_reference Any CPU, numpy Correctness oracle; CI without GPU
hip_gfx1100 AMD Radeon Pro W7900 / RX 7900 XTX (RDNA3) Active backend
hip_gfx1151 AMD Ryzen AI MAX+ 395 / Radeon 8060S (Strix Halo, RDNA3.5) Active backend
cuda_sm86 NVIDIA Ampere consumer (3090-class) Planned peer backend

backend="auto" is the public API/server default. It maps exact gfx1100 and gfx1151 detections to the matching HIP backend; unknown ROCm targets warn and select cpu_reference where a CPU implementation exists. Users on nearby targets such as gfx1101/gfx1102 can force a backend with backend="hip_gfx1100", --backend hip_gfx1100, or HIPENGINE_BACKEND=hip_gfx1100 after validating correctness/performance.

Wave32 is the default for hip_gfx1100 device code; wave64 is treated as an isolated experiment with its own gates (see docs/PLAN.md).

Memory Usage

With BF16 KV cache, hipEngine running the packed Qwen 3.6 PARO model fits a 128K context window in a 24GB-class memory budget. The INT8 KV cache option (with FP16 per-token/per-head scales) uses the --kv-storage int8_per_token_head flag and lets the full 256K context fit under 24 GiB tracked allocator peak.

The numbers below are for shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed on W7900/gfx1100 with q3072 full-attention prefill chunks:

Model Context KV cache Sampled peak Allocator peak Retained KV Prefill Decode
Qwen3.6 35B-A3B PARO 128K BF16 21.04 GiB 21.88 GiB 2.69 GiB 1091.9 tok/s 62.2 tok/s
Qwen3.6 35B-A3B PARO 128K INT8 19.80 GiB 20.89 GiB 1.36 GiB 1076.5 tok/s 60.0 tok/s
Qwen3.6 35B-A3B PARO 256K INT8 21.96 GiB 23.71 GiB 2.71 GiB 670.2 tok/s 40.3 tok/s

Regardless of the difference in PARO weight storage (legacy or packed), loaded-weight memory is about the same — approximately 16.4 GiB in VRAM.

The INT8 KV correctness gate is currently the deterministic Qwen3.5 PARO fixture fixtures/qwen35_paro/parent_512_32_seed1234.json (512-token prompt, 32 greedy decode tokens): max_kl=0.015328, mean_kl=0.001639, top-1 agreement 100%, and generated IDs match BF16 KV exactly. Layer attention probes at context 64 and 520 also had top-1 agreement 100% with max quantized-vs-BF16 KL 2.34e-7. This is a fixture/regression gate, not a long-rollout perplexity study, so long context generations may have unmeasured compounding errors.

The same 128K/128 Qwen3.5 BF16-vs-INT8 run measured -0.99% prefill tok/s and -3.20% decode tok/s for INT8 KV, so speed loss is also very small.

See benchmarks/results/2026-05-19-hipengine-qwen36-packed-int8-kv-readme-memory-diagnostic.json, benchmarks/README.md, and docs/KVCACHE.md for commands, artifacts, and the full no-shadow memory audit.

llama.cpp

When run with q8_0 kvcache, llama.cpp can also fit in 24GB:

--flash-attn on -ctk q8_0 -ctv q8_0 -c 262144 -b 128 -ub 128

Results:

Model llama.cpp model buffer KV cache Compute buffer rocm-smi VRAM used Free VRAM
Q4_K_M 20583 MiB 2720 MiB 203 MiB 24017 MiB / 23.45 GiB ~543 MiB
Q4_K_S 19399 MiB 2720 MiB 203 MiB 22832 MiB / 22.30 GiB ~1728 MiB

With -ub 512:

Model Compute buffer rocm-smi VRAM used Free VRAM
Q4_K_M 812 MiB 24540 MiB ~20 MiB
Q4_K_S 812 MiB 23443 MiB ~1117 MiB
  • Note Q4_K_M is incredibly tight with only 20 MiB of headroom and you may either need to resize down or set -b 512 -ub 128.
  • Q4_K_S does not need small -b/-ub; -ub 512 fits fine, and can even increase to -b 2048 (but -ub is the more important VRAM knob that controls the physical microbatch / compute buffer size for llama.cpp).

Performance

gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900)

While we are far from gfx1100 roofline, the current gfx1100 implementation does well compared to Q4_K_M quants of recent llama.cpp builds (b9042) on the same model family. The latest W7900 hipEngine rows use TheRock ROCm 7.13 and load each resident model once for 1 warmup + 5 measured in-session repetitions per shape. PARO uses the default prefill policy: 512-token prompts stay unchunked and prompts above 1K use 1024/1024/4096/1024/1024 chunks. The hipEngine GGUF Q4_K_S column uses the same chunked-prefill policy plus the WMMA prefill + GEMV decode fast paths and the persistent on-load decode-repack into T16 tile layouts.

Prefill tok/s

Workload hipEngine PARO hipEngine GGUF Q4_K_S llama.cpp HIP llama.cpp Vulkan
512/128 2718.497 2258.847 2436.049 1816.927
4K/128 2838.773 2576.673 2176.905 1705.093
32K/128 2074.699 1893.967 1496.409 1128.554
128K/128 1055.454 998.143 710.213 480.539

Decode tok/s

Workload hipEngine PARO hipEngine GGUF Q4_K_S llama.cpp HIP llama.cpp Vulkan
512/128 103.460 109.152 85.487 127.515
4K/128 101.964 100.048 87.375 120.163
32K/128 90.438 86.774 76.994 98.073
128K/128 59.598 57.954 57.341 64.478

Peak GiB

Workload hipEngine PARO hipEngine GGUF Q4_K_S llama.cpp HIP llama.cpp Vulkan
512/128 20.962 25.108 21.125 20.844
4K/128 21.906 25.108 21.197 20.969
32K/128 22.016 25.108 21.738 21.533
128K/128 22.122 25.108 23.605 23.596

hipEngine W7900 row source: benchmarks/results/2026-05-25-w7900-hipengine-readme-persistent-5run-diagnostic.json. Both hipEngine columns are 5-run medians from one resident session allocated for the maximum requested context (128K/128), so the peak-memory column is a max-context persistent-session high-water mark rather than each shape's minimum allocation. Existing W7900 llama.cpp HIP/Vulkan Q4_K_M rows are reused unchanged. The hipEngine GGUF Q4_K_S column is compared against the existing llama.cpp Q4_K_M baselines because that is the lineage of measured baselines we have on this host; cross-quant comparisons should be read as approximate.

gfx1151 (AMD Ryzen AI MAX+ 395 / Radeon 8060S)

The gfx1151 backend is a native --offload-arch=gfx1151 peer backend using the same registry-keyed kernel surface. The Strix Halo snapshot below uses 256-row prefill chunks, which removed the 4K prefill gap without hurting long-context decode.

Prefill tok/s

Workload hipEngine PARO llama.cpp HIP llama.cpp Vulkan
512/128 983.206 1058.738 638.008
4K/128 1029.402 1004.220 595.400
32K/128 792.296 735.534 407.984
128K/128 413.489 376.070 181.453

Decode tok/s

Workload hipEngine PARO llama.cpp HIP llama.cpp Vulkan
512/128 62.060 50.537 57.615
4K/128 63.605 49.379 55.027
32K/128 50.629 43.435 44.576
128K/128 30.245 31.286 26.935

On Strix Halo, rocm-smi / sysfs expose only a 512 MiB VRAM aperture, so cross-engine memory comparisons are omitted here. The hipEngine allocator high-water mark for the chunk256 sweep was 17.997 GiB (512/128), 18.097 GiB (4K/128), 18.909 GiB (32K/128), and 21.877 GiB (128K/128).

See benchmarks/README.md for full protocol details, correctness status, source-lineage targets, and external comparison baselines.

GGUF Support

As of v0.2.0, hipEngine includes resident Qwen3.6 GGUF support for Q4_K_M and Q4_K_S model files (with more formats planned). This is a major runtime path, not just a loader shim: GGUF has its own quant readers, bulk-prefill path, decode-repacked T16 layouts, and fast-path controls.

Current caveats:

  • PARO models take ~24s to load on the W7900 test host; GGUF currently takes about 60s because decode-repack happens on load. On-disk caching could reduce startup time later, but would require additional storage for repacked layouts.
  • GGUF has higher resident memory than packed PARO. In the current W7900 README sweep, the max-context Q4_K_S session peaks at ~25.1 GiB tracked, so 128K is W7900/48 GiB territory; on 24 GiB cards, expect roughly 64K context with Q4_K_S.
  • GGUF is close enough to PARO to share some high-level scheduling ideas, but in practice it needs substantial GGUF-only kernels and dispatch. The goal for future releases is to keep closing the remaining PARO/GGUF speed gap.

Architecture at a glance

┌─────────────────────────────────────────────────────────────────┐
│  USER API                                                       │
│  hipengine.LLM.generate()           library API                 │
│  hipengine.server                   optional [server] extra     │
├─────────────────────────────────────────────────────────────────┤
│  LOADING (torch-free)                                           │
│  safetensors mmap + hipMemcpyAsync / HF config / jinja2 chat    │
│  templates / HF tokenizers (Rust)                               │
├─────────────────────────────────────────────────────────────────┤
│  DISPATCH                                                       │
│  Scheduler / Block Manager (KVPolicy) / Prefix Cache            │
│  Fusion Planner (chain → kernel plan, fused preferred)          │
│  Model / Quant / Layer plugins / Engine loop (hipGraph replay)  │
├─────────────────────────────────────────────────────────────────┤
│  CORE (torch-free primitives)                                   │
│  hipengine.Tensor / device / memory / stream / graph / blas     │
│  build (hipcc subprocess + ctypes.CDLL + .so cache)             │
├─────────────────────────────────────────────────────────────────┤
│  KERNELS (backend-keyed, 120 __global__ in the Qwen/PARO port)  │
│  kernels/hip_gfx1100/  attention / linear_attn / moe / quant    │
│                        wmma / norm / rotary / fused             │
│  kernels/hip_gfx1151/  native target-arch peer backend          │
│  kernels/cuda_sm86/    (future)                                 │
│  kernels/cpu_reference/ correctness oracle, no GPU required     │
└─────────────────────────────────────────────────────────────────┘

Full layer diagram, plugin axes, KV cache ABI, and roadmap are in docs/PLAN.md.

Installation

# one-time: fetch Git LFS payloads, including the vendored AOTriton runtime/images
git lfs install
git lfs pull

# core runtime (torch-free)
pip install -e .

# with the OpenAI-compatible server
pip install -e ".[server]"

# with the optional dlpack torch bridge for user-boundary interop
pip install -e ".[torch]"

# dev / test
pip install -e ".[dev]"

Python 3.11+. A working ROCm install with libamdhip64.so on the loader path is required for any GPU run; CPU-reference correctness tests run without a GPU.

Quickstart (Phase 0 — bring-up only)

The public API surface is stable:

from hipengine import LLM, SamplingParams

llm = LLM("/path/to/model", quant="w4_paro")  # backend="auto" by default
outputs = llm.generate(
    ["Hello, hipEngine."],
    SamplingParams(max_tokens=64, temperature=0.0),
)
print(outputs[0])

Today LLM.generate() only resolves to narrow Qwen3.5 / PARO bring-up paths registered in hipengine.generation; unsupported (model, backend, quant) combinations fail loudly rather than falling back to a generic torch path. See docs/PLAN.md for the model / quant roadmap.

OpenAI-compatible server

Install the optional server extra and run the FastAPI layer:

pip install -e ".[server]"
python -m hipengine.server \
  --model /path/to/model \
  --quant w4_paro \
  --served-model-name qwen-paro

Supported v0.1 endpoints: GET /v1/models, POST /v1/completions, and POST /v1/chat/completions (including one-chunk SSE for stream=true). See docs/API.md for request examples, bearer-token auth, and current limitations.

Documentation

File Purpose
docs/PLAN.md Architecture, plugin axes, phase roadmap, LoC budgets
docs/BENCHMARK.md Benchmark protocols, baselines, correctness gate, artifact format
docs/TESTING.md RED/GREEN workflow, correctness oracles, fixture policy
docs/KERNELS.md Kernel catalog, source-lineage drift workflow, JIT cache gotchas, build profiles
docs/ENVS.md Environment variables, TheRock setup, benchmark/profiling profiles
docs/ROOFLINE.md RDNA3 / W7900 performance model and decision tree
docs/IMPLEMENTATION.md Implementation status and concrete milestones
docs/API.md OpenAI-compatible server usage and endpoint support
docs/PREFILL.md Native prefill implementation spec
docs/MTP.md Multi-token prediction plan
docs/DFLASH.md DFlash draft-model speculative decode plan
benchmarks/README.md Current-fastest rollup and external comparison baselines
AGENTS.md Ground rules for every coding / review / benchmarking task
WORKLOG.md Append-only cross-session journal of decisions and measurements

Development

# narrowest test suite (CPU-only paths run without a GPU)
pytest -q

# kernel source-lineage drift check before any port
python3 scripts/check_lineage.py --kind kernel --diff stat

See AGENTS.md for the full workflow: when to run the CPU-reference correctness gate, when to add a rocprofv3 --kernel-trace smoke, and what a retained benchmark row requires.

References & lineage

hipEngine is not a fork of any project; it is a brand new codebase with from-scratch code and kernels. Of course it builds on the work of many others:

  • ROCm - of course this all sits on AMD's open-source compute stack, notably on HIP.
  • Nano-vLLM - most of the original kernel tuning iteration loops used this as a host-layer. Some of the performance limitations of the architecture motivated the hipEngine rewrite, but we remain greatful and deeply appreciative of nano-vllm as a great research platform.
  • ParoQuant - after reviewing the current SOTA on model quantization, we chose ParoQuant as the first target due to both its excellent accuracy and its efficiency (QTIP/YAQA is very cool but proved challenging to implement performant RDNA3 kernels)
  • FastDMS - our KVCache ABI is shaped by the lessons learned from building our DMS reference implementation.

Greetz: hipfire, Lucebox, DS4, ExLlamaV3 and ofc the og llama.cpp

See also: Marlin, kernel-anvil, wmma_ops, tilelang, fsr4-rdna3-optimization, ROCm examples

License

hipEngine source code is licensed under AGPL-3.0-or-later. It is built and distributed for anyone who has an AMD card that hasn't been living up to its compute potential.

Model weights, checkpoints, and external datasets remain under their own licenses.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hipengine-0.2.0.tar.gz (13.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hipengine-0.2.0-py3-none-manylinux_2_39_x86_64.whl (17.2 MB view details)

Uploaded Python 3manylinux: glibc 2.39+ x86-64

File details

Details for the file hipengine-0.2.0.tar.gz.

File metadata

  • Download URL: hipengine-0.2.0.tar.gz
  • Upload date:
  • Size: 13.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for hipengine-0.2.0.tar.gz
Algorithm Hash digest
SHA256 314431c671e297f9dc2da1d592bb2a17631d7c0a8a1164b14bcaa22854fea639
MD5 8c21fd61d7483233302fc84401230f7b
BLAKE2b-256 22692de2e27de3fcaf3e3899d442527871dc7591516fcf098abc854a2ff6fb53

See more details on using hashes here.

Provenance

The following attestation bundles were made for hipengine-0.2.0.tar.gz:

Publisher: publish.yml on shisa-ai/hipEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hipengine-0.2.0-py3-none-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for hipengine-0.2.0-py3-none-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 84783a8f5069c0cf16a163b255b0d34f44501bf93d8bb2327718ec749a1294d9
MD5 6452d4f983c66d59ec1ae1bfee9eaa9e
BLAKE2b-256 d252b181bac6783db2394c74c39156fe4557c2e626ef6eb9c01b71212347739b

See more details on using hashes here.

Provenance

The following attestation bundles were made for hipengine-0.2.0-py3-none-manylinux_2_39_x86_64.whl:

Publisher: publish.yml on shisa-ai/hipEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page