Simple paged attention with KV cache for RL scenarios
Project description
bllm
Fast batched LLM inference with native PyTorch. Designed for RL training scenarios where you need:
- Direct weight sharing between inference and training in the same process
- Efficient batching without external servers
- Simple setup (no Ray, no vLLM server)
Installation
pip install -e .
# With API comparison support (for benchmarking against vLLM/Ollama)
pip install -e ".[api]"
Usage
Generate text
# Single prompt
bllm generate Qwen/Qwen2.5-0.5B-Instruct "tell me about yourself"
# Multiple prompts (batched)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '["prompt1", "prompt2", "prompt3"]'
# Shorthand for repeated prompts
bllm generate Qwen/Qwen2.5-0.5B-Instruct '["tell me about yourself"] * 20'
# Variable length prompts for testing
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]'
# Quiet mode (stats only)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '["hello"] * 10' -q
Interactive chat
bllm chat Qwen/Qwen2.5-0.5B-Instruct
bllm chat Qwen/Qwen2.5-0.5B-Instruct --stream
Benchmarking against vLLM or Ollama
Both vLLM and Ollama expose OpenAI-compatible APIs. Use --api to benchmark against either.
vLLM (CUDA)
# Install
pip install -e ".[cuda,api]"
pip install vllm
# Start vLLM server (default uses 90% of GPU memory for KV cache)
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000
# Or limit memory usage (useful when sharing GPU)
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 \
--gpu-memory-utilization 0.3 --max-model-len 2048
# Compare
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q \
--api http://localhost:8000/v1
Ollama (Mac/Linux)
# Install
pip install -e ".[api]"
# Start Ollama
ollama serve
ollama pull qwen2.5:0.5b-instruct
# Compare (Ollama uses port 11434 by default)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q
bllm generate qwen2.5:0.5b-instruct '[lorem(randint(5,50)) for _ in range(20)]' -q \
--api http://localhost:11434/v1
Packed vs padded prefill (CUDA only)
# With Flash Attention (packed, efficient for variable-length prompts)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q
# Without packing (padded to max length)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q --no-pack
RL Integration
The main advantage over vLLM for RL is in-process weight updates:
from bllm.engine.inference_engine import InferenceEngine
# Create engine
engine = InferenceEngine("Qwen/Qwen2.5-0.5B-Instruct", device="cuda")
# Generate rollouts
outputs = engine.generate(["prompt1", "prompt2", ...], max_new_tokens=128)
# Update weights after training step (no serialization, no IPC)
engine.update_weights(new_state_dict)
# Generate with updated weights
outputs = engine.generate(["prompt1", "prompt2", ...], max_new_tokens=128)
Performance
On Apple Silicon (M-series):
- True GPU batching via MPS
- 155+ tok/s for 20 batched prompts (vs Ollama's ~143 tok/s sequential)
On CUDA:
- Higher throughput with optimized kernels
torch.compilesupport for additional speedup
Key optimizations:
- Native GQA support via
enable_gqa=True(PyTorch 2.5+) - View-based KV cache access (no copies for contiguous batches)
- Batched tensor operations (minimal CPU-GPU sync)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bllm_inference-0.1.0.tar.gz.
File metadata
- Download URL: bllm_inference-0.1.0.tar.gz
- Upload date:
- Size: 41.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83718950328c2f7193dc65bd4d24d4c4472c8af49189478312d4d9dd49b78844
|
|
| MD5 |
4bac684723faca130246c4f2bf6f0ad5
|
|
| BLAKE2b-256 |
6fb2fe418910d18366d0b26b5f251c6d11323e8b05ec95604120dcea59181f9f
|
Provenance
The following attestation bundles were made for bllm_inference-0.1.0.tar.gz:
Publisher:
release.yaml on blythed/bllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bllm_inference-0.1.0.tar.gz -
Subject digest:
83718950328c2f7193dc65bd4d24d4c4472c8af49189478312d4d9dd49b78844 - Sigstore transparency entry: 1523817427
- Sigstore integration time:
-
Permalink:
blythed/bllm@ff59da4f9c65feb0d39231fac83a47a5ee83804d -
Branch / Tag:
refs/heads/main - Owner: https://github.com/blythed
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@ff59da4f9c65feb0d39231fac83a47a5ee83804d -
Trigger Event:
push
-
Statement type:
File details
Details for the file bllm_inference-0.1.0-py3-none-any.whl.
File metadata
- Download URL: bllm_inference-0.1.0-py3-none-any.whl
- Upload date:
- Size: 51.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e61262a62f65cd1c8ead9b45b25f470efaac6f8db8b046ebbadc00ae4dcd637
|
|
| MD5 |
7539286b3504f3717711c8e447ef3919
|
|
| BLAKE2b-256 |
743af5bd470387ae32e66f187f04590a0a72214f2b45ecc9ab015fb674912f91
|
Provenance
The following attestation bundles were made for bllm_inference-0.1.0-py3-none-any.whl:
Publisher:
release.yaml on blythed/bllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bllm_inference-0.1.0-py3-none-any.whl -
Subject digest:
2e61262a62f65cd1c8ead9b45b25f470efaac6f8db8b046ebbadc00ae4dcd637 - Sigstore transparency entry: 1523817435
- Sigstore integration time:
-
Permalink:
blythed/bllm@ff59da4f9c65feb0d39231fac83a47a5ee83804d -
Branch / Tag:
refs/heads/main - Owner: https://github.com/blythed
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@ff59da4f9c65feb0d39231fac83a47a5ee83804d -
Trigger Event:
push
-
Statement type: