Skip to main content

vLLM CPU inference engine (AVX512 + VNNI + BF16 + AMX optimized)

Project description

vLLM

Easy, fast, and cheap LLM serving for everyone

GitHub Stars GitHub Forks GitHub Issues GitHub PRs

PyPI Version PyPI Downloads License

Docker Pulls Docker Stars Docker Version Docker Image Size

Last Commit Contributors Repo Size


Buy Me a Coffee

Your support encourages me to keep creating/supporting my open-source projects. If you found value in this project, you can buy me a coffee to keep me up all the sleepless nights.

Buy Me A Coffee

About

vLLM is a fast and easy-to-use library for LLM inference and serving. This PyPl package has support for all the state of the art LLM inference instruction sets availble on most advanced CPUs: AVX512+VNNI+AVX512BF16+AMXBF16.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with AVX512+VNNI+AVX512BF16+AMXBF16 on supported CPUs. Use this package ONLY IF your CPU has amxbf16 or newer instruction sets.
  • Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8
  • Optimized CPU kernels, including integration with FlashAttention and FlashInfer
  • Speculative decoding
  • Chunked prefill

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor, pipeline, data and expert parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support for x86_64, PowerPC CPUs, Arm CPUs and Applie Scilicon (CPU inference). This package does not support any GPU inference. For GPU inference support use the official vLLM PypI
  • Prefix caching support
  • Multi-LoRA support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

  • Transformer-like LLMs (e.g., Llama)
  • Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
  • Embedding Models (e.g., E5-Mistral)
  • Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models here.

Important Notes

Platform Detection Fix (versions 0.8.5 - 0.12.0)

If you encounter RuntimeError: Failed to infer device type or see UnspecifiedPlatform warnings with versions 0.8.5 to 0.12.0, run this one-time fix after installation:

import os, sys, importlib.metadata as m
v = next((d.metadata['Version'] for d in m.distributions() if d.metadata['Name'].startswith('vllm-cpu')), None)
if v:
    p = next((p for p in sys.path if 'site-packages' in p and os.path.isdir(p)), None)
    if p:
        d = os.path.join(p, 'vllm-0.0.0.dist-info'); os.makedirs(d, exist_ok=True)
        open(os.path.join(d, 'METADATA'), 'w').write(f'Metadata-Version: 2.1\nName: vllm\nVersion: {v}+cpu\n')
        print(f'Fixed: vllm version set to {v}+cpu')

This creates a package alias so vLLM detects the CPU platform correctly. Only needed once per environment. Versions 0.8.5.post2+ and 0.12.0+ include this fix automatically.

Getting Started

Install vLLM with a single command:

pip install vllm-cpu-amxbf16 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simple

This installs vllm-cpu-amxbf16 with CPU-optimized PyTorch (no CUDA dependencies).

Alternative: Using uv (faster)

uv pip install vllm-cpu-amxbf16 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simple

Install uv on Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Docker Images

Pre-built Docker images are available on Docker Hub and GitHub Container Registry.

# Pull from Docker Hub
docker pull mekayelanik/vllm-cpu:amxbf16-latest

# Or from GitHub Container Registry
docker pull ghcr.io/mekayelanik/vllm-cpu:amxbf16-latest

# Run OpenAI-compatible API server
docker run -p 8000:8000 \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  mekayelanik/vllm-cpu:amxbf16-latest \
  --model facebook/opt-125m

Available tags: amxbf16-latest, amxbf16-<version> (e.g., amxbf16-0.12.0)

Platforms: linux/amd64

vllm-cpu

This CPU specific vLLM has 5 optimized wheel packages from the upstream vLLM source code:

Package Optimizations Target CPUs
vllm-cpu Baseline (no AVX512) All x86_64 and ARM64 CPUs
vllm-cpu-avx512 AVX512 Intel Skylake-X and newer
vllm-cpu-avx512vnni AVX512 + VNNI Intel Cascade Lake and newer
vllm-cpu-avx512bf16 AVX512 + VNNI + BF16 Intel Cooper Lake and newer
vllm-cpu-amxbf16 AVX512 + VNNI + BF16 + AMX Intel Sapphire Rapids (4th gen Xeon+)

Each package is compiled with specific CPU instruction set flags for optimal inference performance.

Check available CPU instruction sets

lscpu | grep -i flags

Example list of CPUs with their supported instruction sets

CPU Architecture (Intel/AMD) AVX2 AVX-512 F (Base) VNNI (INT8) BF16 (BFloat16) (via AVX-512) AMX-BF16 (via Tile Unit)
Intel 4th Gen / AMD Ryzen Zen2 & Newer Yes No No No No
Intel Skylake-SP / Skylake-X / AMD Zen 4 & Newer Yes Yes No No No
Intel Cooper Lake (3rd Gen Xeon) / AMD Zen 4 (EPYC) / Ryzen Zen5 & Newer Yes Yes Yes Yes No
Intel Sapphire Rapids (4th Gen Xeon) & Newer Yes Yes Yes Yes Yes

***Currently no AMD CPU support AMXBF16. AMD expected to include AMXBF16 support from AMD Zen 7 CPUs


Buy Me a Coffee

Your support encourages me to keep creating/supporting my open-source projects. If you found value in this project, you can buy me a coffee to keep me up all the sleepless nights.

Buy Me A Coffee

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vllm_cpu_amxbf16-0.11.0.post2-cp313-cp313-manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

vllm_cpu_amxbf16-0.11.0.post2-cp312-cp312-manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

vllm_cpu_amxbf16-0.11.0.post2-cp311-cp311-manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

vllm_cpu_amxbf16-0.11.0.post2-cp310-cp310-manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

vllm_cpu_amxbf16-0.11.0.post2-cp39-cp39-manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

File details

Details for the file vllm_cpu_amxbf16-0.11.0.post2-cp313-cp313-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for vllm_cpu_amxbf16-0.11.0.post2-cp313-cp313-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 5b48308be619d42235603e2836b48b73278f0f937fec767703174393ef2b6912
MD5 ae9998b604edef9c39eaf43c16675636
BLAKE2b-256 3067127e891a9e5b2223bb25f6e298b828a6ed30d5d00577fee50227cd9bd864

See more details on using hashes here.

File details

Details for the file vllm_cpu_amxbf16-0.11.0.post2-cp312-cp312-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for vllm_cpu_amxbf16-0.11.0.post2-cp312-cp312-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 d6b152d84eb124bd557bfa068a631e0d30201a70bbafd1f15d31290c1f2926d5
MD5 087f3b8d04befcaf8cbb66075f0b70fe
BLAKE2b-256 52b13f5686f0d399ef0b79f97502ab3885ff561d0c0ffc0c6fab1b017d78732b

See more details on using hashes here.

File details

Details for the file vllm_cpu_amxbf16-0.11.0.post2-cp311-cp311-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for vllm_cpu_amxbf16-0.11.0.post2-cp311-cp311-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 4b3a870efb1230708fe8f55ce63eb04d395494cf2516bf97c51881a0bfe7c825
MD5 e404025f63fbed42fa95559499387393
BLAKE2b-256 777fed02fd3c28d29d52397a2248b39f47ffb3b25f3608330e8cac34d4de7d79

See more details on using hashes here.

File details

Details for the file vllm_cpu_amxbf16-0.11.0.post2-cp310-cp310-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for vllm_cpu_amxbf16-0.11.0.post2-cp310-cp310-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 71fc693f3fa1fce27372af03c72e2442ea80b77118e0b48be9d0ff234a6d6e7d
MD5 06d61e14bc599ff4d0de814c1a2e3d2f
BLAKE2b-256 1f2a5b91a7738c9d898e4ce4d14b9f92202dd6a45103b272cba32f7e8d7b956b

See more details on using hashes here.

File details

Details for the file vllm_cpu_amxbf16-0.11.0.post2-cp39-cp39-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for vllm_cpu_amxbf16-0.11.0.post2-cp39-cp39-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 016220c81f7760c4af070ca2f216c44df4b07d2aeae98ffdde7ce404463be72a
MD5 edb02171d482f96ebecd45c20018a75b
BLAKE2b-256 364db8607195b611dafba8fcbf3ab81fce60425fc6e4661e5ab16eacf72864c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page