ROCm-native local LLM inference engine with a torch-free runtime hot path

These details have not been verified by PyPI

Project links

Project description

hipEngine

hipEngine is a ROCm-native local LLM inference engine designed from the ground up for AMD RDNA GPUs (starting with gfx1100, gfx1151). It pairs a small purpose-built Python host with a complete suite of custom-tuned HIP kernels developed through 100+ iterations of profiling and tuning.

hipEngine has lightweight dependencies with no PyTorch required for fully supported GPUs and models.

Core principles

HIP-first, not CUDA-ported. Kernels directly target AMD hardware like gfx1100/RDNA3 with wave32, vec8 FMA, and the actual cache hierarchy.
Torch-free runtime. import torch is not on the hot path. The runtime owns a thin hipengine.Tensor over raw HIP/CUDA device pointers and drives hipblasLt, hipGraph, AOTriton, and JIT builds through ctypes. Torch appears only as an optional dlpack bridge behind the hipengine[torch] extra (~125 MiB install including the vendored AOTriton subset vs ~2 GiB with torch).
Multi-backend from day one. Kernels live under kernels/hip_gfx1100/, kernels/hip_gfx1151/, kernels/cuda_sm86/, kernels/cpu_reference/ as peer trees.
Four-axis plugin registry. Kernels are keyed by (backend, layer, quant, variant). Models, quant schemes, and layers are plugins. No if backend == "..." or if quant == "..." branches in dispatch / engine / model code.
Fused + unfused coexist. Every fused composite (rmsnorm+rotate, gate_combine_residual, …) has a numerically-equivalent unfused chain registered under its primitives, used as both fallback and correctness baseline.
Evidence-backed performance. Every performance claim ships with model + quant + workload shape + hardware + exact command + correctness gate (KL ≤ 0.05, top-1 ≥ 90% vs kernels/cpu_reference/). See docs/BENCHMARK.md and benchmarks/README.md.

Status

v0.1.x. The runtime hot path is torch-free by construction; kernel families and registry plumbing are landing under hipengine/kernels/hip_gfx1100/. Current single-model tuning targets shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed (19.07 GiB, 4.68 bpw) in packed ParoQuant format.

gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900)

While we are far from gfx1100 roofline, the current gfx1100 implementation does well compared to Q4_K_M quants of recent llama.cpp builds (b9042) on the same model family. The latest W7900 packed rows use the default prefill policy: 512-token prompts stay unchunked and prompts above 1K use 1024/1024/4096/1024/1024 chunks.

Prefill tok/s

Workload	hipEngine shisa Qwen3.6 packed PARO	llama.cpp HIP	llama.cpp Vulkan
512/128	2500.565	2436.049	1816.927
4K/128	2899.685	2176.905	1705.093
32K/128	2115.050	1496.409	1128.554
128K/128	1054.291	710.213	480.539

Decode tok/s

Workload	hipEngine shisa Qwen3.6 packed PARO	llama.cpp HIP	llama.cpp Vulkan
512/128	111.516	85.487	127.515
4K/128	113.094	87.375	120.163
32K/128	97.594	76.994	98.073
128K/128	62.027	57.341	64.478

Peak GiB

Workload	hipEngine shisa Qwen3.6 packed PARO	llama.cpp HIP	llama.cpp Vulkan
512/128	18.123	21.125	20.844
4K/128	19.455	21.197	20.969
32K/128	20.267	21.738	21.533
128K/128	23.235	23.605	23.596

gfx1151 (AMD Ryzen AI MAX+ 395 / Radeon 8060S)

The gfx1151 backend is a native --offload-arch=gfx1151 peer backend using the same registry-keyed kernel surface. The Strix Halo snapshot below uses 256-row prefill chunks, which removed the 4K prefill gap without hurting long-context decode.

Prefill tok/s

Workload	hipEngine shisa Qwen3.6 packed PARO	llama.cpp HIP	llama.cpp Vulkan
512/128	983.206	1058.738	638.008
4K/128	1029.402	1004.220	595.400
32K/128	792.296	735.534	407.984
128K/128	413.489	376.070	181.453

Decode tok/s

Workload	hipEngine shisa Qwen3.6 packed PARO	llama.cpp HIP	llama.cpp Vulkan
512/128	62.060	50.537	57.615
4K/128	63.605	49.379	55.027
32K/128	50.629	43.435	44.576
128K/128	30.245	31.286	26.935

On Strix Halo, rocm-smi / sysfs expose only a 512 MiB VRAM aperture, so cross-engine memory comparisons are omitted here. The hipEngine allocator high-water mark for the chunk256 sweep was 17.997 GiB (512/128), 18.097 GiB (4K/128), 18.909 GiB (32K/128), and 21.877 GiB (128K/128).

See benchmarks/README.md for full protocol details, correctness status, source-lineage targets, and external comparison baselines.

Hardware targets

Backend	Hardware	Status
`hip_gfx1100`	AMD Radeon Pro W7900 / RX 7900 XTX (RDNA3)	Primary, in active bring-up
`hip_gfx1151`	AMD Ryzen AI MAX+ 395 / Radeon 8060S (Strix Halo, RDNA3.5)	Active backend
`cuda_sm86`	NVIDIA Ampere consumer (3090-class)	Planned peer backend
`cpu_reference`	Any CPU, numpy	Correctness oracle; CI without GPU

backend="auto" is the public API/server default. It maps exact gfx1100 and gfx1151 detections to the matching HIP backend; unknown ROCm targets warn and select cpu_reference where a CPU implementation exists. Users on nearby targets such as gfx1101/gfx1102 can force a backend with backend="hip_gfx1100", --backend hip_gfx1100, or HIPENGINE_BACKEND=hip_gfx1100 after validating correctness/performance.

Wave32 is the default for hip_gfx1100 device code; wave64 is treated as an isolated experiment with its own gates (see docs/PLAN.md).

Architecture at a glance

┌─────────────────────────────────────────────────────────────────┐
│  USER API                                                       │
│  hipengine.LLM.generate()           library API                 │
│  hipengine.server                   optional [server] extra     │
├─────────────────────────────────────────────────────────────────┤
│  LOADING (torch-free)                                           │
│  safetensors mmap + hipMemcpyAsync / HF config / jinja2 chat    │
│  templates / HF tokenizers (Rust)                               │
├─────────────────────────────────────────────────────────────────┤
│  DISPATCH                                                       │
│  Scheduler / Block Manager (KVPolicy) / Prefix Cache            │
│  Fusion Planner (chain → kernel plan, fused preferred)          │
│  Model / Quant / Layer plugins / Engine loop (hipGraph replay)  │
├─────────────────────────────────────────────────────────────────┤
│  CORE (torch-free primitives)                                   │
│  hipengine.Tensor / device / memory / stream / graph / blas     │
│  build (hipcc subprocess + ctypes.CDLL + .so cache)             │
├─────────────────────────────────────────────────────────────────┤
│  KERNELS (backend-keyed, 120 __global__ in the Qwen/PARO port)  │
│  kernels/hip_gfx1100/  attention / linear_attn / moe / quant    │
│                        wmma / norm / rotary / fused             │
│  kernels/hip_gfx1151/  native target-arch peer backend           │
│  kernels/cuda_sm86/    (future)                                 │
│  kernels/cpu_reference/ correctness oracle, no GPU required     │
└─────────────────────────────────────────────────────────────────┘

Full layer diagram, plugin axes, KV cache ABI, and roadmap are in docs/PLAN.md.

Installation

# one-time: fetch Git LFS payloads, including the vendored AOTriton runtime/images
git lfs install
git lfs pull

# core runtime (torch-free)
pip install -e .

# with the OpenAI-compatible server
pip install -e ".[server]"

# with the optional dlpack torch bridge for user-boundary interop
pip install -e ".[torch]"

# dev / test
pip install -e ".[dev]"

Python 3.11+. A working ROCm install with libamdhip64.so on the loader path is required for any GPU run; CPU-reference correctness tests run without a GPU.

Quickstart (Phase 0 — bring-up only)

The public API surface is stable:

from hipengine import LLM, SamplingParams

llm = LLM("/path/to/model", quant="w4_paro")  # backend="auto" by default
outputs = llm.generate(
    ["Hello, hipEngine."],
    SamplingParams(max_tokens=64, temperature=0.0),
)
print(outputs[0])

Today LLM.generate() only resolves to narrow Qwen3.5 / PARO bring-up paths registered in hipengine.generation; unsupported (model, backend, quant) combinations fail loudly rather than falling back to a generic torch path. See docs/PLAN.md for the model / quant roadmap.

OpenAI-compatible server

Install the optional server extra and run the FastAPI layer:

pip install -e ".[server]"
python -m hipengine.server \
  --model /path/to/model \
  --quant w4_paro \
  --served-model-name qwen-paro

Supported v0.1 endpoints: GET /v1/models, POST /v1/completions, and POST /v1/chat/completions (including one-chunk SSE for stream=true). See docs/API.md for request examples, bearer-token auth, and current limitations.

Documentation

File	Purpose
`docs/PLAN.md`	Architecture, plugin axes, phase roadmap, LoC budgets
`docs/BENCHMARK.md`	Benchmark protocols, baselines, correctness gate, artifact format
`docs/TESTING.md`	RED/GREEN workflow, correctness oracles, fixture policy
`docs/KERNELS.md`	Kernel catalog, source-lineage drift workflow, JIT cache gotchas, build profiles
`docs/ROOFLINE.md`	RDNA3 / W7900 performance model and decision tree
`docs/IMPLEMENTATION.md`	Implementation status and concrete milestones
`docs/API.md`	OpenAI-compatible server usage and endpoint support
`docs/PREFILL.md`	Native prefill implementation spec
`docs/MTP.md`	Multi-token prediction plan
`docs/DFLASH.md`	DFlash draft-model speculative decode plan
`benchmarks/README.md`	Current-fastest rollup and external comparison baselines
`AGENTS.md`	Ground rules for every coding / review / benchmarking task
`WORKLOG.md`	Append-only cross-session journal of decisions and measurements

Development

# narrowest test suite (CPU-only paths run without a GPU)
pytest -q

# kernel source-lineage drift check before any port
python3 scripts/check_lineage.py --kind kernel --diff stat

See AGENTS.md for the full workflow: when to run the CPU-reference correctness gate, when to add a rocprofv3 --kernel-trace smoke, and what a retained benchmark row requires.

References & lineage

hipEngine is not a fork of any project; it is a brand new codebase with from-scratch code and kernels. Of course it builds on the work of many others:

ROCm - of course this all sits on AMD's open-source compute stack, notably on HIP.
Nano-vLLM - most of the original kernel tuning iteration loops used this as a host-layer. Some of the performance limitations of the architecture motivated the hipEngine rewrite, but we remain greatful and deeply appreciative of nano-vllm as a great research platform.
ParoQuant - after reviewing the current SOTA on model quantization, we chose ParoQuant as the first target due to both its excellent accuracy and its efficiency (QTIP/YAQA is very cool but proved challenging to implement performant RDNA3 kernels)
FastDMS - our KVCache ABI is shaped by the lessons learned from building our DMS reference implementation.

Greetz: hipfire, Lucebox, DS4, ExLlamaV3 and ofc the og llama.cpp

License

hipEngine source code is licensed under AGPL-3.0-or-later. It is built and distributed for anyone who has an AMD card that hasn't been living up to its compute potential.

Model weights, checkpoints, and external datasets remain under their own licenses.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

May 25, 2026

0.2.1

May 24, 2026

0.2.0

May 24, 2026

0.1.1

May 19, 2026

This version

0.1.0

May 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hipengine-0.1.0.tar.gz (12.9 MB view details)

Uploaded May 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hipengine-0.1.0-py3-none-manylinux_2_39_x86_64.whl (16.9 MB view details)

Uploaded May 17, 2026 Python 3manylinux: glibc 2.39+ x86-64

File details

Details for the file hipengine-0.1.0.tar.gz.

File metadata

Download URL: hipengine-0.1.0.tar.gz
Upload date: May 17, 2026
Size: 12.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.9

File hashes

Hashes for hipengine-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`600c88880aaddc1a2a2d6f17d80cd8fd1197e6dafb74662dc8150e9cd2d1d030`
MD5	`bf68e8ac0f74080f3746fb9a35dccad3`
BLAKE2b-256	`1785202f8b1450062dfcb65a04f9e13fbe3d08b9ea7fc365d93f508148d83fa5`

See more details on using hashes here.

File details

Details for the file hipengine-0.1.0-py3-none-manylinux_2_39_x86_64.whl.

File metadata

Download URL: hipengine-0.1.0-py3-none-manylinux_2_39_x86_64.whl
Upload date: May 17, 2026
Size: 16.9 MB
Tags: Python 3, manylinux: glibc 2.39+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.9

File hashes

Hashes for hipengine-0.1.0-py3-none-manylinux_2_39_x86_64.whl
Algorithm	Hash digest
SHA256	`9c7d612941575504114037da466fc0acb5c8bec1fd81d945bee89390c1c786fa`
MD5	`fb2660587021c5cf2996f0e6f06c9280`
BLAKE2b-256	`6328dca68ba70a73da5d56c7500e544ba45742d8f25abfb52a98bf4a82e42d32`

See more details on using hashes here.

hipengine 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

hipEngine

Core principles

Status

gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900)

Prefill tok/s

Decode tok/s

Peak GiB

gfx1151 (AMD Ryzen AI MAX+ 395 / Radeon 8060S)

Prefill tok/s

Decode tok/s

Hardware targets

Architecture at a glance

Installation

Quickstart (Phase 0 — bring-up only)

OpenAI-compatible server

Documentation

Development

References & lineage

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes