Skip to main content

PACE platform plugin for vLLM CPU inference on AMD EPYC processors.

Project description

pace-vllm

vLLM platform plugin for AMD PACE. Once installed, vLLM auto-discovers it and routes its CPU worker through PACE's kernels and KV cache, with no changes to your vLLM scripts. Check out the GitHub repository for more information.

The PACE vLLM plugin brings PACE's CPU optimizations to vLLM with no application code changes, retaining ~95% of standalone PACE efficiency and delivering ~1.3x the performance of native vLLM 0.21 on 5th Gen AMD EPYC processors. More details and technical results here.

What it does

pace-vllm registers PACE as a vLLM CPU platform via the vllm.platform_plugins entry point. The plugin replaces vLLM's stock CPU worker, attention backend, KV cache, and Linear/RMSNorm layers with PACE equivalents; in compile mode it also installs a post-grad pattern matcher that fuses gated/ungated MLP blocks into a single libxsmm call.

Highlights

  • Drop-in plugin - no changes to your vLLM serve script; the vllm.platform_plugins entry point is discovered automatically.
  • SlabPool KV cache - one slab per attention layer, owned by PACE, with sliding-window and sink-attention support.
  • Fused MLP pass - gated SwiGLU/GeGLU and ungated fc1->act->fc2 MLPs (silu / gelu-tanh / gelu-exact / relu) are rewritten into a single pace::libxsmm_fused_mlp call under compile mode.

Requirements

  • Linux x86_64 with AVX512F + AVX512_BF16 (AMD Zen4 / EPYC 5th Gen or newer)
  • Python 3.10 – 3.13
  • vLLM 0.21.x (CPU build)

Install

# 1. vLLM CPU build (pace-vllm is a plugin; it no-ops without vllm).
pip install https://github.com/vllm-project/vllm/releases/download/v0.21.0/vllm-0.21.0+cpu-cp38-abi3-manylinux_2_34_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cpu

# 2. pace-vllm
pip install pace-vllm

Quick example

CLI (vLLM auto-discovers the plugin):

vllm serve meta-llama/Llama-3.1-8B

Python:

from vllm import LLM, SamplingParams


def main() -> None:
    llm = LLM(model="meta-llama/Llama-3.1-8B", dtype="bfloat16")
    out = llm.generate(["The capital of France is"], SamplingParams(max_tokens=8))
    print(out[0].outputs[0].text)


if __name__ == "__main__":
    main()

The if __name__ == "__main__": guard is required: vLLM v1's engine spawns a subprocess for the worker, and without the guard the subprocess re-imports the script and recursively spawns until the OS refuses.

Support

We welcome feedback, suggestions, and bug reports. Should you have any of these, please kindly file an issue on the PACE GitHub page here.

License

pace-vllm is licensed under the MIT License. See the LICENSE file for details. Third-party notices are in NOTICE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pace_vllm-1.2.0-py3-none-manylinux_2_34_x86_64.whl (22.1 MB view details)

Uploaded Python 3manylinux: glibc 2.34+ x86-64

File details

Details for the file pace_vllm-1.2.0-py3-none-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for pace_vllm-1.2.0-py3-none-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 4181c4c8bb8ce98a0daaf9838f0e4a96ffd33c343870dbc289457675b4561e9f
MD5 01f8c8dfc35be5cdc84995f8c8dddbfb
BLAKE2b-256 8bd98b37d9b378d73af75875c4d41afaa993665a523124ff6420e14a93adc128

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page