Skip to main content

PACE (Platform Aware Compute Engine): high-performance LLM inference on AMD CPUs.

Project description

AMD PACE

High-performance LLM inference on AMD EPYC CPUs. PACE is a PyTorch C++ extension with custom AVX512 kernels, slab/paged KV cache, fused operators, and a production-ready serving stack. Check out the GitHub repository for more information.

PACE achieves 1.6x higher autoregressive and 3.2x higher speculative-decoding throughput compared to vLLM on 5th Gen AMD EPYC processors. More details and technical results here.

Highlights

  • SlabPool attention - CPU-native KV cache and attention backend with O(1) slab allocation, L2-aware block sizing, and a unified dispatcher that picks the optimal kernel path per sequence (GQA decode, multi-token decode, tiled prefill) within one OMP dispatch. Continuous batching, sliding-window, and sink attention go through a single entry point.
  • Inference server - pace-server provides a router/engine serving stack with continuous batching, multi-instance NUMA-aware execution, and built-in metrics. The launcher partitions CPU cores across engine instances and binds memory to the local NUMA node.
  • Paged attention - vLLM-style paged KV cache on CPU, fully integrated with PACE's serving stack and all supported models.
  • Fused AVX512 kernels - fused Add+RMSNorm, Add+LayerNorm, RoPE, QKV projections, and a fused MLP kernel (via TPP/libXSMM). Default for all supported models.
  • Broad model support - Llama (up to 3.3), Qwen2/2.5, Phi3/4, Gemma 3, GPT-J, OPT, and GPT-OSS, all running in BF16 under one operator and backend framework. Adding a new architecture is a single-file effort.
  • Speculative decoding (PARD) - built-in parallel-draft speculation, up to 5x throughput over standard autoregressive decoding.

Requirements

  • Linux x86_64 with AVX512F + AVX512_BF16 (AMD Zen4 or newer)
  • Python 3.10 – 3.13

Install

# 1. CPU PyTorch (the +cpu build is not on PyPI; needs PyTorch's index).
pip install --extra-index-url https://download.pytorch.org/whl/cpu torch==2.12.0+cpu

# 2. amd-pace
pip install amd-pace

Quick example

Inference server (router + engine, OpenAI-compatible endpoint):

pace-server --server_model meta-llama/Llama-3.1-8B --kv_cache_type SLAB_POOL --serve_type continuous_prefill_first

For offline programmatic generation (the pace.llm.LLMModel API needs a tokenizer and an OperatorConfig that picks a backend per op), see the runnable scripts at examples/ -- pace_llm_basic.py is the smallest starting point.

Support

We welcome feedback, suggestions, and bug reports. Should you have any of these, please kindly file an issue on the PACE GitHub page here.

License

AMD PACE is licensed under the MIT License. See the LICENSE file for details. Third-party notices are in NOTICE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

amd_pace-1.2.0-py3-none-manylinux_2_34_x86_64.whl (22.3 MB view details)

Uploaded Python 3manylinux: glibc 2.34+ x86-64

File details

Details for the file amd_pace-1.2.0-py3-none-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for amd_pace-1.2.0-py3-none-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 62ebbf0882deb9724cdfc4a35ce8db3f09cf82942e4adeb62f1345b5b0f137bf
MD5 c74c956a46a7e9b3914c7790d526a5c0
BLAKE2b-256 1ea5187d1edec9dfdf31137dae67513dd3b3d6cf809425cc21ed90ee7652abe1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page