Skip to main content

A readable LLM inference server implementing paged attention and continuous batching

Project description

llm-infer

Python Coverage Typed Linting: Ruff CI PyPI License

Unified CLI and client library for local LLM inference. Wraps Ollama, vLLM, and a native engine behind a single interface.

Components:

  • CLI & Server - Single command to serve models via Ollama, vLLM, or native torch engine
  • Client Package - Standard interface to multiple LLM backends (OpenAI, Anthropic, local servers)
  • Native Engine - Custom torch implementation for learning and experimentation

Quick Start

pip install llm-infer

# With Ollama (https://ollama.com)
ollama pull qwen2.5:0.5b
llm-infer serve --model qwen2.5:0.5b

# Query
llm-infer query "What is the capital of France?"

Client Package

llm_infer.client is a Python client library for LLM inference with a unified interface across backends. Built for autonomous agents and production use:

  • Multiple backends - OpenAI, Anthropic, and any OpenAI-compatible API
  • Sync, async, streaming - All execution modes supported
  • Rate limiting - Per-backend request throttling
  • Retry with backoff - Configurable exponential backoff on failures
  • Model routing - Route requests to backends by model name
  • Extensible - Register custom backends via Factory.register()
from appinfra.log import Logger
from llm_infer.client import Factory

lg = Logger("my-app")
factory = Factory(lg)

with factory.openai(base_url="http://localhost:8000/v1") as client:
    response = client.chat(
        messages=[{"role": "user", "content": "Hello!"}],
        system="You are a helpful assistant.",
    )
    print(response.content)

# Streaming
with factory.openai(base_url="http://localhost:8000/v1") as client:
    messages = [{"role": "user", "content": "Hello!"}]
    for token in client.chat_stream(messages):
        print(token, end="", flush=True)

# Async
async with factory.openai(base_url="http://localhost:8000/v1") as client:
    messages = [{"role": "user", "content": "Hello!"}]
    response = await client.chat_async(messages)

Protocol Extensions

The server extends the OpenAI chat completions API:

Request - adds think and adapter fields:

{
  "model": "default",
  "messages": [{"role": "user", "content": "What is 15 * 23?"}],
  "think": true,
  "adapter": "my-lora-adapter"
}

Response - adds thinking in message and adapter metadata:

{
  "id": "chatcmpl-123",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "345",
      "thinking": "Let me calculate step by step..."
    }
  }],
  "adapter": {
    "requested": "my-lora-adapter",
    "actual": "my-lora-adapter",
    "fallback": false
  }
}

The client library exposes these as keyword arguments:

response = client.chat(messages, think=True, adapter="my-adapter")
print(response.thinking)  # Reasoning content
print(response.content)   # Final answer

Multiple Backends

# Anthropic
async with factory.anthropic(model="claude-sonnet-4-20250514") as client:
    response = await client.chat_async(messages)

# OpenAI
with factory.openai(base_url="https://api.openai.com/v1", api_key="sk-...") as client:
    response = client.chat(messages)

Engines

Engine Description Install
ollama (default) Wraps Ollama server ollama.com
vllm vLLM Python API pip install vllm
vllm-server vLLM HTTP subprocess pip install vllm
native Custom torch implementation pip install llm-infer[runtime]
llm-infer serve --model qwen2.5:7b                          # Ollama
llm-infer serve --engine vllm --model-path /path/to/model   # vLLM
llm-infer serve --engine native --model-path /path/to/model # Native

Native Engine

The native engine is a from-scratch torch implementation with PagedAttention and FlashInfer. Useful for learning how LLM inference works or experimenting with custom modifications.

pip install llm-infer[runtime]
llm-infer serve --engine native --model-path /path/to/model

Configuration

# etc/llm-infer.yaml
backends:
  engine: ollama

models:
  locations:
    - /path/to/models
  selection:
    generate:
      default: qwen2.5-7b
    embed:
      default: bge-small-en-v1.5

api:
  host: 0.0.0.0
  port: 8000

Per-model overrides in etc/models.yaml:

models:
  qwen2.5-7b:
    max_model_len: 8192
    vllm:
      enforce_eager: true

  qwen2.5:7b:
    ollama: qwen2.5:7b  # Ollama model name mapping

API Endpoints

Endpoint Description
POST /v1/chat/completions Chat completion (OpenAI-compatible)
POST /v1/completions Text completion (OpenAI-compatible)
GET /v1/models List available models
GET /health Health check
GET /metrics Prometheus metrics

Installation

pip install llm-infer              # Client only
pip install llm-infer[anthropic]   # With Anthropic support
pip install llm-infer[saia]        # With llm-saia integration
pip install llm-infer[runtime]     # With native engine (torch)

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_infer-0.4.0.tar.gz (410.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_infer-0.4.0-py3-none-any.whl (283.0 kB view details)

Uploaded Python 3

File details

Details for the file llm_infer-0.4.0.tar.gz.

File metadata

  • Download URL: llm_infer-0.4.0.tar.gz
  • Upload date:
  • Size: 410.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_infer-0.4.0.tar.gz
Algorithm Hash digest
SHA256 3e82f410ef313832f217884a7edfc5915f99a0dc96b7444c91a3ba6da74fc368
MD5 c2cd746c152940e331b1326d33adf44d
BLAKE2b-256 ed5871034e65bf370cc5ed3eb65727f3ff4ee1306942056dc18f71719c3f3501

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_infer-0.4.0.tar.gz:

Publisher: release.yml on llm-works/llm-infer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_infer-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: llm_infer-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 283.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_infer-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 79dd058c45fec83339487c115be5724ae8cf5678a64fce771467b59c38a9472c
MD5 afe5cfe9115e66b792075d05d9118782
BLAKE2b-256 8ffc09f66b9622d7eeb774293d2e7454e7a4b9f5c5b40ed08ca92d4d2317e300

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_infer-0.4.0-py3-none-any.whl:

Publisher: release.yml on llm-works/llm-infer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page