A readable LLM inference server implementing paged attention and continuous batching
Project description
llm-infer
Unified CLI and client library for local LLM inference. Wraps Ollama, vLLM, and a native engine behind a single interface.
Components:
- CLI & Server - Single command to serve models via Ollama, vLLM, or native torch engine
- Client Package - Standard interface to multiple LLM backends (OpenAI, Anthropic, local servers)
- Native Engine - Custom torch implementation for learning and experimentation
Quick Start
pip install llm-infer
# With Ollama (https://ollama.com)
ollama pull qwen2.5:0.5b
llm-infer serve --model qwen2.5:0.5b
# Query
llm-infer query "What is the capital of France?"
Client Package
llm_infer.client is a Python client library for LLM inference with a unified interface across
backends. Built for autonomous agents and production use:
- Multiple backends - OpenAI, Anthropic, and any OpenAI-compatible API
- Sync, async, streaming - All execution modes supported
- Rate limiting - Per-backend request throttling
- Retry with backoff - Configurable exponential backoff on failures
- Model routing - Route requests to backends by model name
- Extensible - Register custom backends via
Factory.register()
from appinfra.log import Logger
from llm_infer.client import Factory
lg = Logger("my-app")
factory = Factory(lg)
with factory.openai(base_url="http://localhost:8000/v1") as client:
response = client.chat(
messages=[{"role": "user", "content": "Hello!"}],
system="You are a helpful assistant.",
)
print(response.content)
# Streaming
with factory.openai(base_url="http://localhost:8000/v1") as client:
messages = [{"role": "user", "content": "Hello!"}]
for token in client.chat_stream(messages):
print(token, end="", flush=True)
# Async
async with factory.openai(base_url="http://localhost:8000/v1") as client:
messages = [{"role": "user", "content": "Hello!"}]
response = await client.chat_async(messages)
Protocol Extensions
The server extends the OpenAI chat completions API:
Request - adds think and adapter fields:
{
"model": "default",
"messages": [{"role": "user", "content": "What is 15 * 23?"}],
"think": true,
"adapter": "my-lora-adapter"
}
Response - adds thinking in message and adapter metadata:
{
"id": "chatcmpl-123",
"choices": [{
"message": {
"role": "assistant",
"content": "345",
"thinking": "Let me calculate step by step..."
}
}],
"adapter": {
"requested": "my-lora-adapter",
"actual": "my-lora-adapter",
"fallback": false
}
}
The client library exposes these as keyword arguments:
response = client.chat(messages, think=True, adapter="my-adapter")
print(response.thinking) # Reasoning content
print(response.content) # Final answer
Multiple Backends
# Anthropic
async with factory.anthropic(model="claude-sonnet-4-20250514") as client:
response = await client.chat_async(messages)
# OpenAI
with factory.openai(base_url="https://api.openai.com/v1", api_key="sk-...") as client:
response = client.chat(messages)
Engines
| Engine | Description | Install |
|---|---|---|
ollama (default) |
Wraps Ollama server | ollama.com |
vllm |
vLLM Python API | pip install vllm |
vllm-server |
vLLM HTTP subprocess | pip install vllm |
native |
Custom torch implementation | pip install llm-infer[runtime] |
llm-infer serve --model qwen2.5:7b # Ollama
llm-infer serve --engine vllm --model-path /path/to/model # vLLM
llm-infer serve --engine native --model-path /path/to/model # Native
Native Engine
The native engine is a from-scratch torch implementation with PagedAttention and FlashInfer. Useful for learning how LLM inference works or experimenting with custom modifications.
pip install llm-infer[runtime]
llm-infer serve --engine native --model-path /path/to/model
Configuration
# etc/llm-infer.yaml
backends:
engine: ollama
models:
locations:
- /path/to/models
selection:
generate:
default: qwen2.5-7b
embed:
default: bge-small-en-v1.5
api:
host: 0.0.0.0
port: 8000
Per-model overrides in etc/models.yaml:
models:
qwen2.5-7b:
max_model_len: 8192
vllm:
enforce_eager: true
qwen2.5:7b:
ollama: qwen2.5:7b # Ollama model name mapping
API Endpoints
| Endpoint | Description |
|---|---|
POST /v1/chat/completions |
Chat completion (OpenAI-compatible) |
POST /v1/completions |
Text completion (OpenAI-compatible) |
GET /v1/models |
List available models |
GET /health |
Health check |
GET /metrics |
Prometheus metrics |
Installation
pip install llm-infer # Client only
pip install llm-infer[anthropic] # With Anthropic support
pip install llm-infer[saia] # With llm-saia integration
pip install llm-infer[runtime] # With native engine (torch)
License
Apache License 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_infer-0.2.0.tar.gz.
File metadata
- Download URL: llm_infer-0.2.0.tar.gz
- Upload date:
- Size: 334.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b60586dac805ba5bb499bd7c90f8c107938831459e613b1a26abab750e1c3e00
|
|
| MD5 |
b1152e49ab97257ffece256db13bc577
|
|
| BLAKE2b-256 |
40e0b125cdf783a21448f79f3449a2a7d0d4ebb53be03301a6143a666aa6eeec
|
Provenance
The following attestation bundles were made for llm_infer-0.2.0.tar.gz:
Publisher:
release.yml on llm-works/llm-infer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_infer-0.2.0.tar.gz -
Subject digest:
b60586dac805ba5bb499bd7c90f8c107938831459e613b1a26abab750e1c3e00 - Sigstore transparency entry: 1100996232
- Sigstore integration time:
-
Permalink:
llm-works/llm-infer@7a88950fb30c9fdf99e7b8bc59287efbbcbe1467 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/llm-works
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7a88950fb30c9fdf99e7b8bc59287efbbcbe1467 -
Trigger Event:
push
-
Statement type:
File details
Details for the file llm_infer-0.2.0-py3-none-any.whl.
File metadata
- Download URL: llm_infer-0.2.0-py3-none-any.whl
- Upload date:
- Size: 260.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f575b16d1df23326f891b539dde8609c5cf8211de8c8c4eea7bb0065801e492f
|
|
| MD5 |
b577d52f850bad2e62ed82e1cd5b0c85
|
|
| BLAKE2b-256 |
a9655ca2cc81f4dacd4f52aa2490f9221e2715b02fca7423a852260f4b12fa1a
|
Provenance
The following attestation bundles were made for llm_infer-0.2.0-py3-none-any.whl:
Publisher:
release.yml on llm-works/llm-infer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_infer-0.2.0-py3-none-any.whl -
Subject digest:
f575b16d1df23326f891b539dde8609c5cf8211de8c8c4eea7bb0065801e492f - Sigstore transparency entry: 1100996276
- Sigstore integration time:
-
Permalink:
llm-works/llm-infer@7a88950fb30c9fdf99e7b8bc59287efbbcbe1467 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/llm-works
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7a88950fb30c9fdf99e7b8bc59287efbbcbe1467 -
Trigger Event:
push
-
Statement type: