From-scratch paged-attention inference engine: paged KV cache, continuous batching, preemption
Project description
smol-vllm
Paged-attention inference engine: KV cache, continuous batching, preemption. Educational, not production.
Install
pip install smol-vllm
Real models (TinyLlama, Qwen2, etc.):
pip install smol-vllm[tinyllama-1.1b]
# or
pip install smol-vllm[qwen2-0.5b]
Quick Start
FakeModel (no extras):
from smol_vllm import LLMEngine
engine = LLMEngine()
for token in engine.generate([1, 2, 3, 4, 5], max_tokens=20):
print(token, end=" ")
CausalLM (needs [tinyllama-1.1b] or [qwen2-0.5b]):
engine = LLMEngine(use_real_model=True)
tokenizer = engine.model.tokenizer
tokens = tokenizer.encode("Hello!", add_special_tokens=False)
for token in engine.generate(tokens, max_tokens=20):
print(tokenizer.decode([token]), end="")
Models
| Model | model_name |
|---|---|
| TinyLlama 1.1B | TinyLlama/TinyLlama-1.1B-Chat-v1.0 (default) |
| Qwen2 0.5B | Qwen/Qwen2-0.5B-Instruct |
| Phi-2 | microsoft/phi-2 |
| Llama 3.2 | meta-llama/Llama-3.2-1B-Instruct |
| Gemma 2 | google/gemma-2-2b-it |
| Mistral | mistralai/Mistral-7B-Instruct-v0.3 |
Gated models (Llama, Gemma, etc.) need a HuggingFace token. Options:
1. Env var (recommended):
export HF_TOKEN=hf_xxxxxxxxxxxx
2. In code:
LLMEngine(use_real_model=True, model_name="meta-llama/Llama-3.2-1B-Instruct", hf_token="hf_xxxx")
Get a token: huggingface.co/settings/tokens. Accept the model's license on its HF page first.
Demo
smol-vllm-demo
What It Teaches
- PagedAttention — block-based KV cache, ref counting
- Continuous batching — short jobs fill slots immediately
- Preemption & swapping — when memory runs low
- Prefill vs decode — compute-bound → memory-bound
Workflow: run with FakeModel first (zero deps), then switch to CausalLM to compare.
Metrics
Step-level: prefill/decode latency, tok/s, KV util. Summary and CSV logs in logs/.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smol_vllm-0.1.1.tar.gz.
File metadata
- Download URL: smol_vllm-0.1.1.tar.gz
- Upload date:
- Size: 13.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c50c14365ec0a7b7c88d80c6ebadd8a636ec4af2acccbbce34a4df92bc239bec
|
|
| MD5 |
9c57c7b78e56382b51f3acf4759dfca9
|
|
| BLAKE2b-256 |
e03c063d637d61570ca5fab87fdb66d1f10ac685dbf35ce3dfd621185fe98bdb
|
File details
Details for the file smol_vllm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: smol_vllm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61a4c17b73ed87092edd5797e99cf6abe3458734911268a193410baeb43970a0
|
|
| MD5 |
57fb385844a23b775c19655abb7c59c8
|
|
| BLAKE2b-256 |
0ed7c38d10d2d5df30809f57fae40ec743b32a63e3426703325b7730560b8204
|