vllm-cpu

A high-throughput and memory-efficient inference and serving engine for LLMs

These details have not been verified by PyPI

Project links

Project description

vLLM

CPU-Optimized vLLM: Easy, Fast LLM Inference Without a GPU

Unified CPU wheel with automatic ISA detection at runtime (AVX2, AVX-512, VNNI, BF16, AMX, NEON)

Overview

vllm-cpu provides unified CPU wheels for vLLM on PyPI. One package, one pip install, automatic CPU instruction set detection.

Why CPU inference?

No expensive GPU required
Run LLMs on any server, laptop, or edge device
Lower power consumption and operational costs
Ideal for development, testing, and moderate-scale deployments
ARM64 support for AWS Graviton 3+, Apple Silicon, and Ampere (BF16-capable ARM)

Key Features:

pip install vllm-cpu -- no manual URLs or GitHub Release downloads
Built with manylinux_2_28 for broad compatibility (Debian 10+, Ubuntu 18.04+)
Stable ABI (cp38-abi3) -- one wheel for Python 3.10+
Automatic AVX2 / AVX512 / AMX detection at runtime

Quick Start
Installation
Supported CPU Instructions
CPU Compatibility Guide
Usage Examples
Performance Tips
Environment Variables
Supported Models
Framework Integrations
Version Support
Troubleshooting
Links & Resources

Quick Start

1. Install

pip install vllm-cpu

2. Run your first model

from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen3-0.6B", dtype="bfloat16")
outputs = llm.generate(["Hello, my name is"], SamplingParams(max_tokens=50))
print(outputs[0].outputs[0].text)

3. Or start an OpenAI-compatible server

vllm serve Qwen/Qwen3-0.6B --dtype auto

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-0.6B", "prompt": "The future of AI is", "max_tokens": 128}'

Installation

Prerequisites

Requirement	Details
Python	3.10+ (stable ABI -- one wheel for all versions)
OS	Linux (glibc 2.28+) -- Debian 10+, Ubuntu 18.04+, RHEL 8+, Amazon Linux 2023+
CPU	x86_64 with AVX2 (minimum) or AVX512 (optimal), or aarch64
Windows	Use WSL2 (Windows Subsystem for Linux)

pip

pip install vllm-cpu                # Latest
pip install vllm-cpu==0.17.0        # Specific version

uv (faster)

uv pip install vllm-cpu

Virtual environment (recommended)

python -m venv vllm-env && source vllm-env/bin/activate
pip install vllm-cpu

Supported CPU Instructions

The unified wheel automatically detects and uses the best available instruction set at import time. No configuration needed.

	CPU Feature	Benefit
Baseline	AVX2	256-bit SIMD -- works on all modern x86_64
Faster	AVX512	512-bit vectors -- 2x wider than AVX2
Faster	AVX512-VNNI	INT8 multiply-accumulate for quantized inference
Faster	AVX512-BF16	Native BFloat16 -- half the memory of FP32
Fastest	AMX-BF16	Tile-based matrix acceleration (Sapphire Rapids+)
ARM	aarch64 NEON	ARM SIMD for Graviton, Apple Silicon, Ampere

How it works: The wheel ships _C.so (AVX512+BF16+VNNI+AMX) and _C_AVX2.so (AVX2 fallback). At import vllm, the correct .so is loaded once based on CPU capabilities. Zero runtime overhead.

Check your CPU

lscpu | grep -E "avx512|vnni|bf16|amx"

CPU Compatibility Guide

Intel

Generation	Example CPUs	ISA Used
Haswell+ (2013)	Core i5/i7 4th--11th Gen	AVX2
Skylake-X (2017)	Core i9-7900X, Xeon W-2195	AVX512
Cascade Lake (2019)	Xeon Platinum 8280	AVX512 + VNNI
Cooper Lake (2020)	Xeon Platinum 8380H	AVX512 + BF16
Sapphire Rapids+ (2023)	Xeon w9-3495X, 4th/5th/6th Gen Xeon	AVX512 + AMX
Consumer 12th--14th Gen	Core i5/i7/i9 (Alder Lake+)	AVX2

AMD

Generation	Example CPUs	ISA Used
Zen 2/3 (2019--2020)	Ryzen 3000--5000, EPYC 7002--7003	AVX2
Zen 4+ (2022+)	Ryzen 7000+, EPYC 9004+	AVX512 + BF16

ARM

Platform	Example	ISA Used
AWS Graviton 2/3/4	c7g, m7g instances	NEON
Apple Silicon	M1--M4 (via Docker/Lima)	NEON
Ampere Altra	Cloud instances	NEON

Usage Examples

Batch Processing

from vllm import LLM, SamplingParams

llm = LLM(
    model="google/gemma-3-1b-it",
    dtype="bfloat16",
    max_model_len=2048
)

prompts = [
    "Explain quantum computing in simple terms:",
    "Write a Python function to reverse a string:",
]

outputs = llm.generate(prompts, SamplingParams(temperature=0.7, max_tokens=256))
for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}\n")

OpenAI Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="Qwen/Qwen3-4B",
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)
print(response.choices[0].message.content)

cURL

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-4B",
       "messages": [{"role": "user", "content": "Hello!"}]}'

Performance Tips

1. Use TCMalloc (strongly recommended)

Official recommendation: vLLM strongly recommends TCMalloc for high-performance memory allocation and better cache locality.

# Install
sudo apt install libtcmalloc-minimal4        # Debian/Ubuntu
sudo dnf install gperftools-libs              # RHEL/Fedora

# Preload
export LD_PRELOAD=$(find /usr -name "libtcmalloc_minimal.so*" | head -1)
vllm serve your-model --dtype auto

2. Set thread count to physical cores

Tip: Disable hyper-threading on bare-metal for best performance. Reserve 1--2 cores for the HTTP serving framework.

export OMP_NUM_THREADS=16                     # Physical core count
export MKL_NUM_THREADS=16
export VLLM_CPU_OMP_THREADS_BIND=0-13         # Pin inference threads
export VLLM_CPU_NUM_OF_RESERVED_CPU=2          # Reserve for HTTP serving

3. Use BFloat16

Note: Float16 is unstable on CPU. Always use bfloat16.

llm = LLM(model="your-model", dtype="bfloat16")

4. NUMA optimization (multi-socket systems)

# Simple: bind to one NUMA node
numactl --cpunodebind=0 --membind=0 python your_script.py

# Advanced: Tensor Parallel across NUMA nodes
VLLM_CPU_OMP_THREADS_BIND=0-31|32-63 vllm serve your-model \
  --dtype auto --tensor-parallel-size 2

5. Tune KV cache

export VLLM_CPU_KVCACHE_SPACE=40              # 40 GB for KV cache

6. SGL kernels (x86, experimental)

export VLLM_CPU_SGL_KERNEL=1                  # Low-latency online serving

7. Quantized models

llm = LLM(model="Qwen/Qwen3-8B-GPTQ-Int4", quantization="gptq")

Memory Estimation

Model Size	bfloat16	GPTQ INT4
1B params	~4 GB	~2 GB
7B params	~16 GB	~6 GB
13B params	~28 GB	~10 GB
70B params	~140 GB	~40 GB

Add 2--8 GB for KV cache depending on VLLM_CPU_KVCACHE_SPACE and context length.

Environment Variables

Variable	Description	Default
`VLLM_CPU_KVCACHE_SPACE`	KV cache size in GB (larger = more concurrent requests)	0 (auto)
`VLLM_CPU_OMP_THREADS_BIND`	CPU core binding (`0-31`, `auto`, or `nobind`)	auto
`VLLM_CPU_NUM_OF_RESERVED_CPU`	Cores reserved for HTTP serving (when bind=auto)	0
`VLLM_CPU_SGL_KERNEL`	Small-batch optimized kernels (x86, experimental)	0
`OMP_NUM_THREADS`	OpenMP thread count	All cores
`MKL_NUM_THREADS`	Intel MKL thread count	All cores
`LD_PRELOAD`	Preload TCMalloc for better memory performance	--
`HF_TOKEN`	Hugging Face access token	--
`HF_HOME`	Hugging Face cache directory	~/.cache/huggingface

Supported Models

vLLM supports 100+ model architectures including:

Category	Models
LLMs	Llama 2/3/3.1/3.2, Mistral, Mixtral, Qwen 2/2.5/3, Phi-2/3/4, Gemma 2/3, DeepSeek V2/V3/R1
Code	CodeLlama, DeepSeek-Coder, StarCoder 1/2, CodeGemma, Qwen2.5-Coder
Embedding	E5-Mistral, GTE, BGE, Nomic-Embed, Jina
Multimodal	LLaVA, Qwen-VL, Qwen2.5-VL, InternVL, Pixtral, MiniCPM-V
MoE	Mixtral 8x7B/8x22B, DeepSeek-MoE, Qwen-MoE, DBRX

Full list: vLLM Supported Models

Framework Integrations

vLLM's server is fully OpenAI API-compatible. Any client that supports base_url override works out of the box.

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    model="Qwen/Qwen3-4B"
)
response = llm.invoke("Explain machine learning in simple terms")

LlamaIndex

from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(
    api_base="http://localhost:8000/v1",
    api_key="not-needed",
    model="Qwen/Qwen3-4B"
)
response = llm.complete("What is the capital of France?")

Also works with: Semantic Kernel, AutoGen, CrewAI, Haystack, and any OpenAI-compatible SDK.

Version Support

Version Range	Strategy	Status
v0.17.0+	Unified CPU wheel (this package)	Active
v0.8.5 -- v0.15.x	Legacy 5-variant wheels	Archived on PyPI

Legacy packages (vllm-cpu-avx512, vllm-cpu-avx512vnni, vllm-cpu-avx512bf16, vllm-cpu-amxbf16) remain on PyPI for older vLLM versions but are no longer updated.

Troubleshooting

Illegal Instruction Error

The unified wheel auto-detects CPU capabilities. If you still see this:

lscpu | grep -E "avx512|vnni|bf16|amx"    # Check supported features

If no AVX2 flags appear, your CPU is too old for vLLM CPU inference.

Out of Memory (OOM)

llm = LLM(model="your-model", max_model_len=2048, dtype="bfloat16")

export VLLM_CPU_KVCACHE_SPACE=2               # Reduce KV cache

Slow Performance Checklist

Check	Command / Fix
TCMalloc loaded?	`echo $LD_PRELOAD` -- should show libtcmalloc
Thread count correct?	`echo $OMP_NUM_THREADS` -- should equal physical cores
Hyper-threading disabled?	Recommended for bare-metal
Cross-NUMA access?	Use `VLLM_CPU_OMP_THREADS_BIND` to pin to one node
Using bfloat16?	Float16 is unstable on CPU -- always use `dtype="bfloat16"`

Multiple vLLM Packages Conflict

pip uninstall vllm vllm-cpu vllm-cpu-avx512 vllm-cpu-avx512vnni vllm-cpu-avx512bf16 vllm-cpu-amxbf16 -y
pip install vllm-cpu

RuntimeError: Failed to infer device type

For legacy versions (v0.8.5--v0.15.x), use .post2 releases:

pip install vllm-cpu==0.12.0.post2

Links & Resources

	Resource	Link
Source	GitHub Repository	github.com/MekayelAnik/vllm-cpu
Docker	Docker Hub Images	hub.docker.com/r/mekayelanik/vllm-cpu
GHCR	GitHub Container Registry	ghcr.io/mekayelanik/vllm-cpu
Docs	vLLM Documentation	docs.vllm.ai
Upstream	vLLM Project	github.com/vllm-project/vllm
Issues	Report a Bug	github.com/MekayelAnik/vllm-cpu/issues
Releases	Changelog	GitHub Releases

License: GPL-3.0 | Upstream: Apache-2.0

_{Built from vLLM, originally developed at Sky Computing Lab, UC Berkeley}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.21.0

May 15, 2026

0.20.2

May 10, 2026

0.20.1

May 3, 2026

0.20.0

Apr 28, 2026

0.19.1

Apr 18, 2026

0.19.0

Apr 5, 2026

0.18.1

Apr 6, 2026

0.18.0

Apr 6, 2026

This version

0.17.1

Apr 6, 2026

0.17.0

Apr 4, 2026

0.16.0

Apr 5, 2026

0.15.1

Apr 5, 2026

0.15.0

Jan 29, 2026

0.14.1

Jan 24, 2026

0.14.0

Jan 20, 2026

0.13.0

Dec 19, 2025

0.12.0.post2

Dec 7, 2025

0.12.0

Dec 3, 2025

0.11.2.post2

Dec 7, 2025

0.11.2

Nov 27, 2025

0.11.1.post2

Dec 7, 2025

0.11.1

Nov 27, 2025

0.11.0.post2

Dec 7, 2025

0.11.0

Nov 27, 2025

0.10.2.post2

Dec 7, 2025

0.10.2

Nov 27, 2025

0.10.1.1.post2

Dec 7, 2025

0.10.1.1

Dec 3, 2025

0.10.1.post2

Dec 7, 2025

0.10.1

Nov 27, 2025

0.10.0.post2

Dec 7, 2025

0.10.0

Nov 27, 2025

0.9.2.post2

Dec 7, 2025

0.9.2

Dec 3, 2025

0.9.1.post2

Dec 7, 2025

0.9.1

Dec 3, 2025

0.9.0.1.post2

Dec 7, 2025

0.9.0.1

Dec 3, 2025

0.9.0.post2

Dec 7, 2025

0.9.0

Dec 3, 2025

0.8.5.post2

Dec 6, 2025

0.8.5

Dec 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_cpu-0.17.1-cp38-abi3-manylinux_2_28_x86_64.whl (57.4 MB view details)

Uploaded Apr 6, 2026 CPython 3.8+manylinux: glibc 2.28+ x86-64

vllm_cpu-0.17.1-cp38-abi3-manylinux_2_28_aarch64.whl (34.2 MB view details)

Uploaded Apr 6, 2026 CPython 3.8+manylinux: glibc 2.28+ ARM64

File details

Details for the file vllm_cpu-0.17.1-cp38-abi3-manylinux_2_28_x86_64.whl.

File metadata

Download URL: vllm_cpu-0.17.1-cp38-abi3-manylinux_2_28_x86_64.whl
Upload date: Apr 6, 2026
Size: 57.4 MB
Tags: CPython 3.8+, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllm_cpu-0.17.1-cp38-abi3-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`b0c113121f3981bd1c20c30ae7058b504f94a97b5e62cea900f9d9baeb484733`
MD5	`bf5c29c0ca618ca516f16ba90015f8b6`
BLAKE2b-256	`927cbb327a6f2c641314425b2a1cdc9ce7e170517908ba9f4c0f659c8445d051`

See more details on using hashes here.

File details

Details for the file vllm_cpu-0.17.1-cp38-abi3-manylinux_2_28_aarch64.whl.

File metadata

Download URL: vllm_cpu-0.17.1-cp38-abi3-manylinux_2_28_aarch64.whl
Upload date: Apr 6, 2026
Size: 34.2 MB
Tags: CPython 3.8+, manylinux: glibc 2.28+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllm_cpu-0.17.1-cp38-abi3-manylinux_2_28_aarch64.whl
Algorithm	Hash digest
SHA256	`0ab962b7d7e8026b0a8da021f8424ee5914c91bae8d15428cf5a865c8e108ee1`
MD5	`b6b9ddc38d5e0a47beec44c6f42faea2`
BLAKE2b-256	`e58aa15a6665bca6923e8bee9d53ac912e39abe4fdfaf7c7843966a55e034b37`

See more details on using hashes here.

vllm-cpu 0.17.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CPU-Optimized vLLM: Easy, Fast LLM Inference Without a GPU

Overview

Table of Contents

Quick Start

Installation

Prerequisites

pip

uv (faster)

Virtual environment (recommended)

Supported CPU Instructions

Check your CPU

CPU Compatibility Guide

Intel

AMD

ARM

Usage Examples

Batch Processing

OpenAI Python Client

cURL

Performance Tips

1. Use TCMalloc (strongly recommended)

2. Set thread count to physical cores

3. Use BFloat16

4. NUMA optimization (multi-socket systems)

5. Tune KV cache

6. SGL kernels (x86, experimental)

7. Quantized models

Memory Estimation

Environment Variables

Supported Models

Framework Integrations

LangChain

LlamaIndex

Version Support

Troubleshooting

Illegal Instruction Error

Out of Memory (OOM)

Slow Performance Checklist

Multiple vLLM Packages Conflict

RuntimeError: Failed to infer device type

Links & Resources

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes