Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

vLLM

CPU-Optimized vLLM: Easy, Fast LLM Inference Without a GPU

Unified CPU wheel with automatic ISA detection at runtime (AVX2, AVX-512, VNNI, BF16, AMX, NEON, FP16, DOTPROD)

PyPI Version PyPI Downloads Python Versions

This is an independent, community-maintained package — not affiliated with or funded by the vLLM project, its sister concerns, or any hardware vendors. The first successful unification of different CPU ISAs (AVX2, AVX-512, VNNI, BF16, AMX) into a single wheel was done by Mekayel Anik, for the benefit of the community.

GitHub Stars GitHub Forks GitHub Issues License

Last Commit Contributors


Overview

vllm-cpu provides unified CPU wheels for vLLM on PyPI. One package, one pip install, automatic CPU instruction set detection.

Why CPU inference?

  • No expensive GPU required
  • Run LLMs on any server, laptop, or edge device
  • Lower power consumption and operational costs
  • Ideal for development, testing, and moderate-scale deployments
  • ARM64 support for AWS Graviton 3+, Ampere Altra, and other aarch64 servers (NEON + BF16/DOTPROD)

Key Features:

  • pip3 install vllm-cpu -- no manual URLs or GitHub Release downloads
  • Built with manylinux_2_28 for broad compatibility (Debian 10+, Ubuntu 18.04+)
  • Stable ABI (cp38-abi3) -- one wheel for Python 3.10+
  • Automatic ISA detection at runtime (AVX2/AVX-512/AMX on x86, NEON/BF16 on ARM)

Table of Contents


Quick Start

1. Install

pip3 install vllm-cpu

2. Run your first model

from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen3-0.6B", dtype="bfloat16")
outputs = llm.generate(["Hello, my name is"], SamplingParams(max_tokens=50))
print(outputs[0].outputs[0].text)

3. Or start an OpenAI-compatible server

vllm serve Qwen/Qwen3-0.6B --dtype auto
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-0.6B", "prompt": "The future of AI is", "max_tokens": 128}'

Installation

Prerequisites

Requirement Details
Python 3.10+ (stable ABI -- one wheel for all versions)
OS Linux (glibc 2.28+) -- Debian 10+, Ubuntu 18.04+, RHEL 8+, Amazon Linux 2023+
CPU x86_64 with AVX2 (minimum) or AVX-512 (optimal), or aarch64 with NEON (BF16 recommended)
Windows Use WSL2 (Windows Subsystem for Linux)

pip

pip3 install vllm-cpu                # Latest
pip3 install vllm-cpu==0.17.0        # Specific version

uv (faster)

uv pip install vllm-cpu

Virtual environment (recommended)

python -m venv vllm-env && source vllm-env/bin/activate
pip3 install vllm-cpu

Supported CPU Instructions

The unified wheel automatically detects and uses the best available instruction set at import time. No configuration needed.

CPU Feature Benefit
Baseline AVX2 256-bit SIMD -- works on all modern x86_64
Faster AVX512 512-bit vectors -- 2x wider than AVX2
Faster AVX512-VNNI INT8 multiply-accumulate for quantized inference
Faster AVX512-BF16 Native BFloat16 -- half the memory of FP32
Fastest AMX-BF16 Tile-based matrix acceleration (Sapphire Rapids+)
ARM aarch64 NEON ARM SIMD baseline for all aarch64
ARM aarch64 FP16 Half-precision float (always enabled)
ARM aarch64 DOTPROD INT8 dot product acceleration (always enabled)
ARM aarch64 BF16 Native BFloat16 (Graviton 3+, Ampere Altra+)

How it works: The wheel ships _C.so (AVX512+BF16+VNNI+AMX) and _C_AVX2.so (AVX2 fallback). At import vllm, the correct .so is loaded once based on CPU capabilities. Zero runtime overhead.

Check your CPU

# x86_64
lscpu | grep -E "avx512|vnni|bf16|amx"

# aarch64
cat /proc/cpuinfo | grep -i "features" | head -1
# Look for: asimd (NEON), bf16

CPU Compatibility Guide

Intel

Generation Example CPUs ISA Used
Haswell+ (2013) Core i5/i7 4th--11th Gen AVX2
Skylake-X (2017) Core i9-7900X, Xeon W-2195 AVX512
Cascade Lake (2019) Xeon Platinum 8280 AVX512 + VNNI
Cooper Lake (2020) Xeon Platinum 8380H AVX512 + BF16
Sapphire Rapids+ (2023) Xeon w9-3495X, 4th/5th/6th Gen Xeon AVX512 + AMX
Consumer 12th--14th Gen Core i5/i7/i9 (Alder Lake+) AVX2

AMD

Generation Example CPUs ISA Used
Zen 2/3 (2019--2020) Ryzen 3000--5000, EPYC 7002--7003 AVX2
Zen 4+ (2022+) Ryzen 7000+, EPYC 9004+ AVX512 + BF16

ARM

Platform Example ISA Used
AWS Graviton 2/3/4 c7g, m7g instances NEON
Apple Silicon M1--M4 (via Docker/Lima) NEON
Ampere Altra Cloud instances NEON

Usage Examples

Batch Processing

from vllm import LLM, SamplingParams

llm = LLM(
    model="google/gemma-3-1b-it",
    dtype="bfloat16",
    max_model_len=2048
)

prompts = [
    "Explain quantum computing in simple terms:",
    "Write a Python function to reverse a string:",
]

outputs = llm.generate(prompts, SamplingParams(temperature=0.7, max_tokens=256))
for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}\n")

OpenAI Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="Qwen/Qwen3-4B",
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)
print(response.choices[0].message.content)

cURL

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-4B",
       "messages": [{"role": "user", "content": "Hello!"}]}'

Performance Tips

1. Use TCMalloc (strongly recommended)

Official recommendation: vLLM strongly recommends TCMalloc for high-performance memory allocation and better cache locality.

# Install
sudo apt install libtcmalloc-minimal4        # Debian/Ubuntu
sudo dnf install gperftools-libs              # RHEL/Fedora

# Preload
export LD_PRELOAD=$(find /usr -name "libtcmalloc_minimal.so*" | head -1)
vllm serve your-model --dtype auto

2. Set thread count to physical cores

Tip: Disable hyper-threading on bare-metal for best performance. Reserve 1--2 cores for the HTTP serving framework.

export OMP_NUM_THREADS=16                     # Physical core count
export MKL_NUM_THREADS=16
export VLLM_CPU_OMP_THREADS_BIND=0-13         # Pin inference threads
export VLLM_CPU_NUM_OF_RESERVED_CPU=2          # Reserve for HTTP serving

3. Use BFloat16

Note: Float16 is unstable on CPU. Always use bfloat16.

llm = LLM(model="your-model", dtype="bfloat16")

4. NUMA optimization (multi-socket systems)

# Simple: bind to one NUMA node
numactl --cpunodebind=0 --membind=0 python your_script.py

# Advanced: Tensor Parallel across NUMA nodes
VLLM_CPU_OMP_THREADS_BIND=0-31|32-63 vllm serve your-model \
  --dtype auto --tensor-parallel-size 2

5. Tune KV cache

export VLLM_CPU_KVCACHE_SPACE=40              # 40 GB for KV cache

6. SGL kernels (x86, experimental)

export VLLM_CPU_SGL_KERNEL=1                  # Low-latency online serving

7. Quantized models

llm = LLM(model="Qwen/Qwen3-8B-GPTQ-Int4", quantization="gptq")

Memory Estimation

Model Size bfloat16 GPTQ INT4
1B params ~4 GB ~2 GB
7B params ~16 GB ~6 GB
13B params ~28 GB ~10 GB
70B params ~140 GB ~40 GB

Add 2--8 GB for KV cache depending on VLLM_CPU_KVCACHE_SPACE and context length.


Environment Variables

Variable Description Default
VLLM_CPU_KVCACHE_SPACE KV cache size in GB (larger = more concurrent requests) 0 (auto)
VLLM_CPU_OMP_THREADS_BIND CPU core binding (0-31, auto, or nobind) auto
VLLM_CPU_NUM_OF_RESERVED_CPU Cores reserved for HTTP serving (when bind=auto) 0
VLLM_CPU_SGL_KERNEL Small-batch optimized kernels (x86, experimental) 0
OMP_NUM_THREADS OpenMP thread count All cores
MKL_NUM_THREADS Intel MKL thread count All cores
LD_PRELOAD Preload TCMalloc for better memory performance --
HF_TOKEN Hugging Face access token --
HF_HOME Hugging Face cache directory ~/.cache/huggingface

Supported Models

vLLM supports 100+ model architectures including:

Category Models
LLMs Llama 2/3/3.1/3.2, Mistral, Mixtral, Qwen 2/2.5/3, Phi-2/3/4, Gemma 2/3, DeepSeek V2/V3/R1
Code CodeLlama, DeepSeek-Coder, StarCoder 1/2, CodeGemma, Qwen2.5-Coder
Embedding E5-Mistral, GTE, BGE, Nomic-Embed, Jina
Multimodal LLaVA, Qwen-VL, Qwen2.5-VL, InternVL, Pixtral, MiniCPM-V
MoE Mixtral 8x7B/8x22B, DeepSeek-MoE, Qwen-MoE, DBRX

Full list: vLLM Supported Models


Framework Integrations

vLLM's server is fully OpenAI API-compatible. Any client that supports base_url override works out of the box.

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    model="Qwen/Qwen3-4B"
)
response = llm.invoke("Explain machine learning in simple terms")

LlamaIndex

from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(
    api_base="http://localhost:8000/v1",
    api_key="not-needed",
    model="Qwen/Qwen3-4B"
)
response = llm.complete("What is the capital of France?")

Also works with: Semantic Kernel, AutoGen, CrewAI, Haystack, and any OpenAI-compatible SDK.


Version Support

Version Range Strategy Status
v0.17.0+ Unified CPU wheel (this package) Active
v0.8.5 -- v0.15.x Legacy 5-variant wheels Archived on PyPI

Legacy packages (vllm-cpu-avx512, vllm-cpu-avx512vnni, vllm-cpu-avx512bf16, vllm-cpu-amxbf16) remain on PyPI for older vLLM versions but are no longer updated.


Troubleshooting

Illegal Instruction Error

The unified wheel auto-detects CPU capabilities. If you still see this:

lscpu | grep -E "avx512|vnni|bf16|amx"    # Check supported features

If no AVX2 flags appear, your CPU is too old for vLLM CPU inference.

Out of Memory (OOM)

llm = LLM(model="your-model", max_model_len=2048, dtype="bfloat16")
export VLLM_CPU_KVCACHE_SPACE=2               # Reduce KV cache

Slow Performance Checklist

Check Command / Fix
TCMalloc loaded? echo $LD_PRELOAD -- should show libtcmalloc
Thread count correct? echo $OMP_NUM_THREADS -- should equal physical cores
Hyper-threading disabled? Recommended for bare-metal
Cross-NUMA access? Use VLLM_CPU_OMP_THREADS_BIND to pin to one node
Using bfloat16? Float16 is unstable on CPU -- always use dtype="bfloat16"

Multiple vLLM Packages Conflict

pip3 uninstall vllm vllm-cpu vllm-cpu-avx512 vllm-cpu-avx512vnni vllm-cpu-avx512bf16 vllm-cpu-amxbf16 -y
pip3 install vllm-cpu

RuntimeError: Failed to infer device type

For legacy versions (v0.8.5--v0.15.x), use .post2 releases:

pip3 install vllm-cpu==0.12.0.post2

Links & Resources

Resource Link
Source GitHub Repository github.com/MekayelAnik/vllm-cpu
Docker Docker Hub Images hub.docker.com/r/mekayelanik/vllm-cpu
GHCR GitHub Container Registry ghcr.io/mekayelanik/vllm-cpu
Docs vLLM Documentation docs.vllm.ai
Upstream vLLM Project github.com/vllm-project/vllm
Issues Report a Bug github.com/MekayelAnik/vllm-cpu/issues
Releases Changelog GitHub Releases

License: GPL-3.0 | Upstream: Apache-2.0

Built from vLLM, originally developed at Sky Computing Lab, UC Berkeley

Buy Me A Coffee

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vllm_cpu-0.20.1-cp38-abi3-manylinux_2_28_x86_64.whl (79.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ x86-64

vllm_cpu-0.20.1-cp38-abi3-manylinux_2_28_aarch64.whl (37.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ ARM64

File details

Details for the file vllm_cpu-0.20.1-cp38-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for vllm_cpu-0.20.1-cp38-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c746ce74ab26e439eb1fa4c5a596da3b1eedec88e80c75c8e437b58649facb9e
MD5 286b08fad9ace0c6f5041a3deb2bac0f
BLAKE2b-256 5ea9a650b4b96bbe0696dd19fa841c7aa617e878c4604199090f47a01d762862

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_cpu-0.20.1-cp38-abi3-manylinux_2_28_x86_64.whl:

Publisher: build-cpu-wheel.yml on MekayelAnik/vllm-cpu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vllm_cpu-0.20.1-cp38-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for vllm_cpu-0.20.1-cp38-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 1a6bc049ed09b5e2f8948ec3853ec48fa2aa91a37b54b5a8c51725fb493b7e97
MD5 1fe5b2ca3b8dc4a69c2800cad6cbeb3f
BLAKE2b-256 6184b668f14a93bd5096f65a1f55481ae3333270f12bd6e2681a84596c55c94c

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_cpu-0.20.1-cp38-abi3-manylinux_2_28_aarch64.whl:

Publisher: build-cpu-wheel.yml on MekayelAnik/vllm-cpu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page