PACE platform plugin for vLLM CPU inference on AMD EPYC processors.
Project description
pace-vllm
vLLM platform plugin for AMD PACE. Once installed, vLLM auto-discovers it and routes its CPU worker through PACE's kernels and KV cache, with no changes to your vLLM scripts. Check out the GitHub repository for more information.
The PACE vLLM plugin brings PACE's CPU optimizations to vLLM with no application code changes, retaining ~95% of standalone PACE efficiency and delivering ~1.3x the performance of native vLLM 0.21 on 5th Gen AMD EPYC processors. More details and technical results here.
What it does
pace-vllm registers PACE as a vLLM CPU platform via the vllm.platform_plugins entry point. The plugin replaces vLLM's stock CPU worker, attention backend, KV cache, and Linear/RMSNorm layers with PACE equivalents; in compile mode it also installs a post-grad pattern matcher that fuses gated/ungated MLP blocks into a single libxsmm call.
Highlights
- Drop-in plugin - no changes to your vLLM serve script; the
vllm.platform_pluginsentry point is discovered automatically. - SlabPool KV cache - one slab per attention layer, owned by PACE, with sliding-window and sink-attention support.
- Fused MLP pass - gated SwiGLU/GeGLU and ungated fc1->act->fc2 MLPs (silu / gelu-tanh / gelu-exact / relu) are rewritten into a single
pace::libxsmm_fused_mlpcall under compile mode.
Requirements
- Linux x86_64 with AVX512F + AVX512_BF16 (AMD Zen4 / EPYC 5th Gen or newer)
- Python 3.10 – 3.13
- vLLM 0.21.x (CPU build)
Install
# 1. vLLM CPU build (pace-vllm is a plugin; it no-ops without vllm).
pip install https://github.com/vllm-project/vllm/releases/download/v0.21.0/vllm-0.21.0+cpu-cp38-abi3-manylinux_2_34_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cpu
# 2. pace-vllm
pip install pace-vllm
Quick example
CLI (vLLM auto-discovers the plugin):
vllm serve meta-llama/Llama-3.1-8B
Python:
from vllm import LLM, SamplingParams
def main() -> None:
llm = LLM(model="meta-llama/Llama-3.1-8B", dtype="bfloat16")
out = llm.generate(["The capital of France is"], SamplingParams(max_tokens=8))
print(out[0].outputs[0].text)
if __name__ == "__main__":
main()
The
if __name__ == "__main__":guard is required: vLLM v1's engine spawns a subprocess for the worker, and without the guard the subprocess re-imports the script and recursively spawns until the OS refuses.
Support
We welcome feedback, suggestions, and bug reports. Should you have any of these, please kindly file an issue on the PACE GitHub page here.
License
pace-vllm is licensed under the MIT License. See the LICENSE file for details. Third-party notices are in NOTICE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pace_vllm-1.2.0-py3-none-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: pace_vllm-1.2.0-py3-none-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 22.1 MB
- Tags: Python 3, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4181c4c8bb8ce98a0daaf9838f0e4a96ffd33c343870dbc289457675b4561e9f
|
|
| MD5 |
01f8c8dfc35be5cdc84995f8c8dddbfb
|
|
| BLAKE2b-256 |
8bd98b37d9b378d73af75875c4d41afaa993665a523124ff6420e14a93adc128
|