Simple paged attention with KV cache for RL scenarios

Project description

bllm

Fast batched LLM inference with native PyTorch. Designed for RL training scenarios where you need:

Direct weight sharing between inference and training in the same process
Efficient batching without external servers
Simple setup (no Ray, no vLLM server)

Installation

pip install -e .

# With API comparison support (for benchmarking against vLLM/Ollama)
pip install -e ".[api]"

Usage

Generate text

# Single prompt
bllm generate Qwen/Qwen2.5-0.5B-Instruct "tell me about yourself"

# Multiple prompts (batched)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '["prompt1", "prompt2", "prompt3"]'

# Shorthand for repeated prompts
bllm generate Qwen/Qwen2.5-0.5B-Instruct '["tell me about yourself"] * 20'

# Variable length prompts for testing
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]'

# Quiet mode (stats only)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '["hello"] * 10' -q

Interactive chat

bllm chat Qwen/Qwen2.5-0.5B-Instruct
bllm chat Qwen/Qwen2.5-0.5B-Instruct --stream

Benchmarking against vLLM or Ollama

Both vLLM and Ollama expose OpenAI-compatible APIs. Use --api to benchmark against either.

vLLM (CUDA)

# Install
pip install -e ".[cuda,api]"
pip install vllm

# Start vLLM server (default uses 90% of GPU memory for KV cache)
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000

# Or limit memory usage (useful when sharing GPU)
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 \
    --gpu-memory-utilization 0.3 --max-model-len 2048

# Compare
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q \
    --api http://localhost:8000/v1

Ollama (Mac/Linux)

# Install
pip install -e ".[api]"

# Start Ollama
ollama serve
ollama pull qwen2.5:0.5b-instruct

# Compare (Ollama uses port 11434 by default)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q
bllm generate qwen2.5:0.5b-instruct '[lorem(randint(5,50)) for _ in range(20)]' -q \
    --api http://localhost:11434/v1

Packed vs padded prefill (CUDA only)

# With Flash Attention (packed, efficient for variable-length prompts)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q

# Without packing (padded to max length)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q --no-pack

RL Integration

The main advantage over vLLM for RL is in-process weight updates:

from bllm.engine.inference_engine import InferenceEngine

# Create engine
engine = InferenceEngine("Qwen/Qwen2.5-0.5B-Instruct", device="cuda")

# Generate rollouts
outputs = engine.generate(["prompt1", "prompt2", ...], max_new_tokens=128)

# Update weights after training step (no serialization, no IPC)
engine.update_weights(new_state_dict)

# Generate with updated weights
outputs = engine.generate(["prompt1", "prompt2", ...], max_new_tokens=128)

Performance

On Apple Silicon (M-series):

True GPU batching via MPS
155+ tok/s for 20 batched prompts (vs Ollama's ~143 tok/s sequential)

On CUDA:

Higher throughput with optimized kernels
torch.compile support for additional speedup

Key optimizations:

Native GQA support via enable_gqa=True (PyTorch 2.5+)
View-based KV cache access (no copies for contiguous batches)
Batched tensor operations (minimal CPU-GPU sync)

Project details

Release history Release notifications | RSS feed

This version

0.1.0

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bllm_inference-0.1.0.tar.gz (41.8 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bllm_inference-0.1.0-py3-none-any.whl (51.1 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file bllm_inference-0.1.0.tar.gz.

File metadata

Download URL: bllm_inference-0.1.0.tar.gz
Upload date: May 13, 2026
Size: 41.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bllm_inference-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`83718950328c2f7193dc65bd4d24d4c4472c8af49189478312d4d9dd49b78844`
MD5	`4bac684723faca130246c4f2bf6f0ad5`
BLAKE2b-256	`6fb2fe418910d18366d0b26b5f251c6d11323e8b05ec95604120dcea59181f9f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bllm_inference-0.1.0.tar.gz:

Publisher: release.yaml on blythed/bllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bllm_inference-0.1.0.tar.gz
- Subject digest: 83718950328c2f7193dc65bd4d24d4c4472c8af49189478312d4d9dd49b78844
- Sigstore transparency entry: 1523817427
- Sigstore integration time: May 13, 2026
Source repository:
- Permalink: blythed/bllm@ff59da4f9c65feb0d39231fac83a47a5ee83804d
- Branch / Tag: refs/heads/main
- Owner: https://github.com/blythed
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@ff59da4f9c65feb0d39231fac83a47a5ee83804d
- Trigger Event: push

File details

Details for the file bllm_inference-0.1.0-py3-none-any.whl.

File metadata

Download URL: bllm_inference-0.1.0-py3-none-any.whl
Upload date: May 13, 2026
Size: 51.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bllm_inference-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e61262a62f65cd1c8ead9b45b25f470efaac6f8db8b046ebbadc00ae4dcd637`
MD5	`7539286b3504f3717711c8e447ef3919`
BLAKE2b-256	`743af5bd470387ae32e66f187f04590a0a72214f2b45ecc9ab015fb674912f91`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bllm_inference-0.1.0-py3-none-any.whl:

Publisher: release.yaml on blythed/bllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bllm_inference-0.1.0-py3-none-any.whl
- Subject digest: 2e61262a62f65cd1c8ead9b45b25f470efaac6f8db8b046ebbadc00ae4dcd637
- Sigstore transparency entry: 1523817435
- Sigstore integration time: May 13, 2026
Source repository:
- Permalink: blythed/bllm@ff59da4f9c65feb0d39231fac83a47a5ee83804d
- Branch / Tag: refs/heads/main
- Owner: https://github.com/blythed
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@ff59da4f9c65feb0d39231fac83a47a5ee83804d
- Trigger Event: push

bllm-inference 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

bllm

Installation

Usage

Generate text

Interactive chat

Benchmarking against vLLM or Ollama

vLLM (CUDA)

Ollama (Mac/Linux)

Packed vs padded prefill (CUDA only)

RL Integration

Performance

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance