PACE (Platform Aware Compute Engine): high-performance LLM inference on AMD CPUs.
Project description
AMD PACE
High-performance LLM inference on AMD EPYC CPUs. PACE is a PyTorch C++ extension with custom AVX512 kernels, slab/paged KV cache, fused operators, and a production-ready serving stack. Check out the GitHub repository for more information.
PACE achieves 1.6x higher autoregressive and 3.2x higher speculative-decoding throughput compared to vLLM on 5th Gen AMD EPYC processors. More details and technical results here.
Highlights
- SlabPool attention - CPU-native KV cache and attention backend with O(1) slab allocation, L2-aware block sizing, and a unified dispatcher that picks the optimal kernel path per sequence (GQA decode, multi-token decode, tiled prefill) within one OMP dispatch. Continuous batching, sliding-window, and sink attention go through a single entry point.
- Inference server -
pace-serverprovides a router/engine serving stack with continuous batching, multi-instance NUMA-aware execution, and built-in metrics. The launcher partitions CPU cores across engine instances and binds memory to the local NUMA node. - Paged attention - vLLM-style paged KV cache on CPU, fully integrated with PACE's serving stack and all supported models.
- Fused AVX512 kernels - fused Add+RMSNorm, Add+LayerNorm, RoPE, QKV projections, and a fused MLP kernel (via TPP/libXSMM). Default for all supported models.
- Broad model support - Llama (up to 3.3), Qwen2/2.5, Phi3/4, Gemma 3, GPT-J, OPT, and GPT-OSS, all running in BF16 under one operator and backend framework. Adding a new architecture is a single-file effort.
- Speculative decoding (PARD) - built-in parallel-draft speculation, up to 5x throughput over standard autoregressive decoding.
Requirements
- Linux x86_64 with AVX512F + AVX512_BF16 (AMD Zen4 or newer)
- Python 3.10 – 3.13
Install
# 1. CPU PyTorch (the +cpu build is not on PyPI; needs PyTorch's index).
pip install --extra-index-url https://download.pytorch.org/whl/cpu torch==2.12.0+cpu
# 2. amd-pace
pip install amd-pace
Quick example
Inference server (router + engine, OpenAI-compatible endpoint):
pace-server --server_model meta-llama/Llama-3.1-8B --kv_cache_type SLAB_POOL --serve_type continuous_prefill_first
For offline programmatic generation (the pace.llm.LLMModel API needs a
tokenizer and an OperatorConfig that picks a backend per op), see the
runnable scripts at
examples/ --
pace_llm_basic.py
is the smallest starting point.
Support
We welcome feedback, suggestions, and bug reports. Should you have any of these, please kindly file an issue on the PACE GitHub page here.
License
AMD PACE is licensed under the MIT License. See the LICENSE file for details. Third-party notices are in NOTICE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file amd_pace-1.2.0-py3-none-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: amd_pace-1.2.0-py3-none-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 22.3 MB
- Tags: Python 3, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62ebbf0882deb9724cdfc4a35ce8db3f09cf82942e4adeb62f1345b5b0f137bf
|
|
| MD5 |
c74c956a46a7e9b3914c7790d526a5c0
|
|
| BLAKE2b-256 |
1ea5187d1edec9dfdf31137dae67513dd3b3d6cf809425cc21ed90ee7652abe1
|