Skip to main content

Simple paged attention with KV cache for RL scenarios

Project description

bllm

Fast batched LLM inference with native PyTorch. Designed for RL training scenarios where you need:

  • Direct weight sharing between inference and training in the same process
  • Efficient batching without external servers
  • Simple setup (no Ray, no vLLM server)

Installation

pip install -e .

# With API comparison support (for benchmarking against vLLM/Ollama)
pip install -e ".[api]"

Usage

Generate text

# Single prompt
bllm generate Qwen/Qwen2.5-0.5B-Instruct "tell me about yourself"

# Multiple prompts (batched)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '["prompt1", "prompt2", "prompt3"]'

# Shorthand for repeated prompts
bllm generate Qwen/Qwen2.5-0.5B-Instruct '["tell me about yourself"] * 20'

# Variable length prompts for testing
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]'

# Quiet mode (stats only)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '["hello"] * 10' -q

Interactive chat

bllm chat Qwen/Qwen2.5-0.5B-Instruct
bllm chat Qwen/Qwen2.5-0.5B-Instruct --stream

Benchmarking against vLLM or Ollama

Both vLLM and Ollama expose OpenAI-compatible APIs. Use --api to benchmark against either.

vLLM (CUDA)

# Install
pip install -e ".[cuda,api]"
pip install vllm

# Start vLLM server (default uses 90% of GPU memory for KV cache)
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000

# Or limit memory usage (useful when sharing GPU)
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 \
    --gpu-memory-utilization 0.3 --max-model-len 2048

# Compare
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q \
    --api http://localhost:8000/v1

Ollama (Mac/Linux)

# Install
pip install -e ".[api]"

# Start Ollama
ollama serve
ollama pull qwen2.5:0.5b-instruct

# Compare (Ollama uses port 11434 by default)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q
bllm generate qwen2.5:0.5b-instruct '[lorem(randint(5,50)) for _ in range(20)]' -q \
    --api http://localhost:11434/v1

Packed vs padded prefill (CUDA only)

# With Flash Attention (packed, efficient for variable-length prompts)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q

# Without packing (padded to max length)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q --no-pack

RL Integration

The main advantage over vLLM for RL is in-process weight updates:

from bllm.engine.inference_engine import InferenceEngine

# Create engine
engine = InferenceEngine("Qwen/Qwen2.5-0.5B-Instruct", device="cuda")

# Generate rollouts
outputs = engine.generate(["prompt1", "prompt2", ...], max_new_tokens=128)

# Update weights after training step (no serialization, no IPC)
engine.update_weights(new_state_dict)

# Generate with updated weights
outputs = engine.generate(["prompt1", "prompt2", ...], max_new_tokens=128)

Performance

On Apple Silicon (M-series):

  • True GPU batching via MPS
  • 155+ tok/s for 20 batched prompts (vs Ollama's ~143 tok/s sequential)

On CUDA:

  • Higher throughput with optimized kernels
  • torch.compile support for additional speedup

Key optimizations:

  • Native GQA support via enable_gqa=True (PyTorch 2.5+)
  • View-based KV cache access (no copies for contiguous batches)
  • Batched tensor operations (minimal CPU-GPU sync)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bllm_inference-0.1.0.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bllm_inference-0.1.0-py3-none-any.whl (51.1 kB view details)

Uploaded Python 3

File details

Details for the file bllm_inference-0.1.0.tar.gz.

File metadata

  • Download URL: bllm_inference-0.1.0.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bllm_inference-0.1.0.tar.gz
Algorithm Hash digest
SHA256 83718950328c2f7193dc65bd4d24d4c4472c8af49189478312d4d9dd49b78844
MD5 4bac684723faca130246c4f2bf6f0ad5
BLAKE2b-256 6fb2fe418910d18366d0b26b5f251c6d11323e8b05ec95604120dcea59181f9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for bllm_inference-0.1.0.tar.gz:

Publisher: release.yaml on blythed/bllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bllm_inference-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bllm_inference-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 51.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bllm_inference-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2e61262a62f65cd1c8ead9b45b25f470efaac6f8db8b046ebbadc00ae4dcd637
MD5 7539286b3504f3717711c8e447ef3919
BLAKE2b-256 743af5bd470387ae32e66f187f04590a0a72214f2b45ecc9ab015fb674912f91

See more details on using hashes here.

Provenance

The following attestation bundles were made for bllm_inference-0.1.0-py3-none-any.whl:

Publisher: release.yaml on blythed/bllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page