Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

vLLM

CPU-Optimized vLLM: Easy, Fast LLM Inference Without a GPU

Unified CPU wheel with automatic ISA detection at runtime (AVX2, AVX-512, VNNI, BF16, AMX, NEON, FP16, DOTPROD)

PyPI Version PyPI Downloads Python Versions

This is an independent, community-maintained package — not affiliated with or funded by the vLLM project, its sister concerns, or any hardware vendors. The first successful unification of different CPU ISAs (AVX2, AVX-512, VNNI, BF16, AMX) into a single wheel was done by Mekayel Anik, for the benefit of the community.

GitHub Stars GitHub Forks GitHub Issues License

Last Commit Contributors


Overview

vllm-cpu provides unified CPU wheels for vLLM on PyPI. One package, one pip install, automatic CPU instruction set detection.

Why CPU inference?

  • No expensive GPU required
  • Run LLMs on any server, laptop, or edge device
  • Lower power consumption and operational costs
  • Ideal for development, testing, and moderate-scale deployments
  • ARM64 support for AWS Graviton 3+, Ampere Altra, and other aarch64 servers (NEON + BF16/DOTPROD)

Key Features:

  • pip3 install vllm-cpu -- no manual URLs or GitHub Release downloads
  • Built with manylinux_2_28 for broad compatibility (Debian 10+, Ubuntu 18.04+)
  • Stable ABI (cp38-abi3) -- one wheel for Python 3.10+
  • Automatic ISA detection at runtime (AVX2/AVX-512/AMX on x86, NEON/BF16 on ARM)

Table of Contents


Quick Start

1. Install

pip3 install vllm-cpu

2. Run your first model

from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen3-0.6B", dtype="bfloat16")
outputs = llm.generate(["Hello, my name is"], SamplingParams(max_tokens=50))
print(outputs[0].outputs[0].text)

3. Or start an OpenAI-compatible server

vllm serve Qwen/Qwen3-0.6B --dtype auto
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-0.6B", "prompt": "The future of AI is", "max_tokens": 128}'

Installation

Prerequisites

Requirement Details
Python 3.10+ (stable ABI -- one wheel for all versions)
OS Linux (glibc 2.28+) -- Debian 10+, Ubuntu 18.04+, RHEL 8+, Amazon Linux 2023+
CPU x86_64 with AVX2 (minimum) or AVX-512 (optimal), or aarch64 with NEON (BF16 recommended)
Windows Use WSL2 (Windows Subsystem for Linux)

pip

pip3 install vllm-cpu                # Latest
pip3 install vllm-cpu==0.17.0        # Specific version

uv (faster)

uv pip install vllm-cpu

Virtual environment (recommended)

python -m venv vllm-env && source vllm-env/bin/activate
pip3 install vllm-cpu

Supported CPU Instructions

The unified wheel automatically detects and uses the best available instruction set at import time. No configuration needed.

CPU Feature Benefit
Baseline AVX2 256-bit SIMD -- works on all modern x86_64
Faster AVX512 512-bit vectors -- 2x wider than AVX2
Faster AVX512-VNNI INT8 multiply-accumulate for quantized inference
Faster AVX512-BF16 Native BFloat16 -- half the memory of FP32
Fastest AMX-BF16 Tile-based matrix acceleration (Sapphire Rapids+)
ARM aarch64 NEON ARM SIMD baseline for all aarch64
ARM aarch64 FP16 Half-precision float (always enabled)
ARM aarch64 DOTPROD INT8 dot product acceleration (always enabled)
ARM aarch64 BF16 Native BFloat16 (Graviton 3+, Ampere Altra+)

How it works: The wheel ships _C.so (AVX512+BF16+VNNI+AMX) and _C_AVX2.so (AVX2 fallback). At import vllm, the correct .so is loaded once based on CPU capabilities. Zero runtime overhead.

Check your CPU

# x86_64
lscpu | grep -E "avx512|vnni|bf16|amx"

# aarch64
cat /proc/cpuinfo | grep -i "features" | head -1
# Look for: asimd (NEON), bf16

CPU Compatibility Guide

Intel

Generation Example CPUs ISA Used
Haswell+ (2013) Core i5/i7 4th--11th Gen AVX2
Skylake-X (2017) Core i9-7900X, Xeon W-2195 AVX512
Cascade Lake (2019) Xeon Platinum 8280 AVX512 + VNNI
Cooper Lake (2020) Xeon Platinum 8380H AVX512 + BF16
Sapphire Rapids+ (2023) Xeon w9-3495X, 4th/5th/6th Gen Xeon AVX512 + AMX
Consumer 12th--14th Gen Core i5/i7/i9 (Alder Lake+) AVX2

AMD

Generation Example CPUs ISA Used
Zen 2/3 (2019--2020) Ryzen 3000--5000, EPYC 7002--7003 AVX2
Zen 4+ (2022+) Ryzen 7000+, EPYC 9004+ AVX512 + BF16

ARM

Platform Example ISA Used
AWS Graviton 2/3/4 c7g, m7g instances NEON
Apple Silicon M1--M4 (via Docker/Lima) NEON
Ampere Altra Cloud instances NEON

Usage Examples

Batch Processing

from vllm import LLM, SamplingParams

llm = LLM(
    model="google/gemma-3-1b-it",
    dtype="bfloat16",
    max_model_len=2048
)

prompts = [
    "Explain quantum computing in simple terms:",
    "Write a Python function to reverse a string:",
]

outputs = llm.generate(prompts, SamplingParams(temperature=0.7, max_tokens=256))
for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}\n")

OpenAI Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="Qwen/Qwen3-4B",
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)
print(response.choices[0].message.content)

cURL

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-4B",
       "messages": [{"role": "user", "content": "Hello!"}]}'

Performance Tips

1. Use TCMalloc (strongly recommended)

Official recommendation: vLLM strongly recommends TCMalloc for high-performance memory allocation and better cache locality.

# Install
sudo apt install libtcmalloc-minimal4        # Debian/Ubuntu
sudo dnf install gperftools-libs              # RHEL/Fedora

# Preload
export LD_PRELOAD=$(find /usr -name "libtcmalloc_minimal.so*" | head -1)
vllm serve your-model --dtype auto

2. Set thread count to physical cores

Tip: Disable hyper-threading on bare-metal for best performance. Reserve 1--2 cores for the HTTP serving framework.

export OMP_NUM_THREADS=16                     # Physical core count
export MKL_NUM_THREADS=16
export VLLM_CPU_OMP_THREADS_BIND=0-13         # Pin inference threads
export VLLM_CPU_NUM_OF_RESERVED_CPU=2          # Reserve for HTTP serving

3. Use BFloat16

Note: Float16 is unstable on CPU. Always use bfloat16.

llm = LLM(model="your-model", dtype="bfloat16")

4. NUMA optimization (multi-socket systems)

# Simple: bind to one NUMA node
numactl --cpunodebind=0 --membind=0 python your_script.py

# Advanced: Tensor Parallel across NUMA nodes
VLLM_CPU_OMP_THREADS_BIND=0-31|32-63 vllm serve your-model \
  --dtype auto --tensor-parallel-size 2

5. Tune KV cache

export VLLM_CPU_KVCACHE_SPACE=40              # 40 GB for KV cache

6. SGL kernels (x86, experimental)

export VLLM_CPU_SGL_KERNEL=1                  # Low-latency online serving

7. Quantized models

llm = LLM(model="Qwen/Qwen3-8B-GPTQ-Int4", quantization="gptq")

Memory Estimation

Model Size bfloat16 GPTQ INT4
1B params ~4 GB ~2 GB
7B params ~16 GB ~6 GB
13B params ~28 GB ~10 GB
70B params ~140 GB ~40 GB

Add 2--8 GB for KV cache depending on VLLM_CPU_KVCACHE_SPACE and context length.


Environment Variables

Variable Description Default
VLLM_CPU_KVCACHE_SPACE KV cache size in GB (larger = more concurrent requests) 0 (auto)
VLLM_CPU_OMP_THREADS_BIND CPU core binding (0-31, auto, or nobind) auto
VLLM_CPU_NUM_OF_RESERVED_CPU Cores reserved for HTTP serving (when bind=auto) 0
VLLM_CPU_SGL_KERNEL Small-batch optimized kernels (x86, experimental) 0
OMP_NUM_THREADS OpenMP thread count All cores
MKL_NUM_THREADS Intel MKL thread count All cores
LD_PRELOAD Preload TCMalloc for better memory performance --
HF_TOKEN Hugging Face access token --
HF_HOME Hugging Face cache directory ~/.cache/huggingface

Supported Models

vLLM supports 100+ model architectures including:

Category Models
LLMs Llama 2/3/3.1/3.2, Mistral, Mixtral, Qwen 2/2.5/3, Phi-2/3/4, Gemma 2/3, DeepSeek V2/V3/R1
Code CodeLlama, DeepSeek-Coder, StarCoder 1/2, CodeGemma, Qwen2.5-Coder
Embedding E5-Mistral, GTE, BGE, Nomic-Embed, Jina
Multimodal LLaVA, Qwen-VL, Qwen2.5-VL, InternVL, Pixtral, MiniCPM-V
MoE Mixtral 8x7B/8x22B, DeepSeek-MoE, Qwen-MoE, DBRX

Full list: vLLM Supported Models


Framework Integrations

vLLM's server is fully OpenAI API-compatible. Any client that supports base_url override works out of the box.

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    model="Qwen/Qwen3-4B"
)
response = llm.invoke("Explain machine learning in simple terms")

LlamaIndex

from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(
    api_base="http://localhost:8000/v1",
    api_key="not-needed",
    model="Qwen/Qwen3-4B"
)
response = llm.complete("What is the capital of France?")

Also works with: Semantic Kernel, AutoGen, CrewAI, Haystack, and any OpenAI-compatible SDK.


Version Support

Version Range Strategy Status
v0.17.0+ Unified CPU wheel (this package) Active
v0.8.5 -- v0.15.x Legacy 5-variant wheels Archived on PyPI

Legacy packages (vllm-cpu-avx512, vllm-cpu-avx512vnni, vllm-cpu-avx512bf16, vllm-cpu-amxbf16) remain on PyPI for older vLLM versions but are no longer updated.


Troubleshooting

Illegal Instruction Error

The unified wheel auto-detects CPU capabilities. If you still see this:

lscpu | grep -E "avx512|vnni|bf16|amx"    # Check supported features

If no AVX2 flags appear, your CPU is too old for vLLM CPU inference.

Out of Memory (OOM)

llm = LLM(model="your-model", max_model_len=2048, dtype="bfloat16")
export VLLM_CPU_KVCACHE_SPACE=2               # Reduce KV cache

Slow Performance Checklist

Check Command / Fix
TCMalloc loaded? echo $LD_PRELOAD -- should show libtcmalloc
Thread count correct? echo $OMP_NUM_THREADS -- should equal physical cores
Hyper-threading disabled? Recommended for bare-metal
Cross-NUMA access? Use VLLM_CPU_OMP_THREADS_BIND to pin to one node
Using bfloat16? Float16 is unstable on CPU -- always use dtype="bfloat16"

Multiple vLLM Packages Conflict

pip3 uninstall vllm vllm-cpu vllm-cpu-avx512 vllm-cpu-avx512vnni vllm-cpu-avx512bf16 vllm-cpu-amxbf16 -y
pip3 install vllm-cpu

RuntimeError: Failed to infer device type

For legacy versions (v0.8.5--v0.15.x), use .post2 releases:

pip3 install vllm-cpu==0.12.0.post2

Links & Resources

Resource Link
Source GitHub Repository github.com/MekayelAnik/vllm-cpu
Docker Docker Hub Images hub.docker.com/r/mekayelanik/vllm-cpu
GHCR GitHub Container Registry ghcr.io/mekayelanik/vllm-cpu
Docs vLLM Documentation docs.vllm.ai
Upstream vLLM Project github.com/vllm-project/vllm
Issues Report a Bug github.com/MekayelAnik/vllm-cpu/issues
Releases Changelog GitHub Releases

License: GPL-3.0 | Upstream: Apache-2.0

Built from vLLM, originally developed at Sky Computing Lab, UC Berkeley

Buy Me A Coffee

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vllm_cpu-0.20.0-cp38-abi3-manylinux_2_28_x86_64.whl (79.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ x86-64

vllm_cpu-0.20.0-cp38-abi3-manylinux_2_28_aarch64.whl (37.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ ARM64

File details

Details for the file vllm_cpu-0.20.0-cp38-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for vllm_cpu-0.20.0-cp38-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8b554cec41298dd7c8ba8a2e5be4b5e97846d50a2357ff16b79d723d3b65c059
MD5 39db4a57788342586cc03a8645c18884
BLAKE2b-256 2c5f23901b7e1de7d650166fe929d48770233e6156686b15293b5f0054e40e98

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_cpu-0.20.0-cp38-abi3-manylinux_2_28_x86_64.whl:

Publisher: build-cpu-wheel.yml on MekayelAnik/vllm-cpu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vllm_cpu-0.20.0-cp38-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for vllm_cpu-0.20.0-cp38-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6ef6096711d1df3acda40aba83355eb274546c4bedc6e04d1ff057d87cc1ae4e
MD5 157c6949eeda1cb7466823e66a5c0950
BLAKE2b-256 d4e686ad9fabeda29e4a5fe7a79c4b437cc9b10e227501904ac890c37f1003ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_cpu-0.20.0-cp38-abi3-manylinux_2_28_aarch64.whl:

Publisher: build-cpu-wheel.yml on MekayelAnik/vllm-cpu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page