Skip to main content

A readable LLM inference server implementing paged attention and continuous batching

Project description

llm-infer

Python Coverage Typed Linting: Ruff CI PyPI License

Unified CLI and client library for local LLM inference. Wraps Ollama, vLLM, and a native engine behind a single interface.

Components:

  • CLI & Server - Single command to serve models via Ollama, vLLM, or native torch engine
  • Client Package - Standard interface to multiple LLM backends (OpenAI, Anthropic, local servers)
  • Native Engine - Custom torch implementation for learning and experimentation

Quick Start

pip install llm-infer

# With Ollama (https://ollama.com)
ollama pull qwen2.5:0.5b
llm-infer serve --model qwen2.5:0.5b

# Query
llm-infer query "What is the capital of France?"

Client Package

llm_infer.client is a Python client library for LLM inference with a unified interface across backends. Built for autonomous agents and production use:

  • Multiple backends - OpenAI, Anthropic, and any OpenAI-compatible API
  • Sync, async, streaming - All execution modes supported
  • Rate limiting - Per-backend request throttling
  • Retry with backoff - Configurable exponential backoff on failures
  • Model routing - Route requests to backends by model name
  • Extensible - Register custom backends via Factory.register()
from appinfra.log import Logger
from llm_infer.client import Factory

lg = Logger("my-app")
factory = Factory(lg)

with factory.openai(base_url="http://localhost:8000/v1") as client:
    response = client.chat(
        messages=[{"role": "user", "content": "Hello!"}],
        system="You are a helpful assistant.",
    )
    print(response.content)

# Streaming
with factory.openai(base_url="http://localhost:8000/v1") as client:
    messages = [{"role": "user", "content": "Hello!"}]
    for token in client.chat_stream(messages):
        print(token, end="", flush=True)

# Async
async with factory.openai(base_url="http://localhost:8000/v1") as client:
    messages = [{"role": "user", "content": "Hello!"}]
    response = await client.chat_async(messages)

Protocol Extensions

The server extends the OpenAI chat completions API:

Request - adds think and adapter fields:

{
  "model": "default",
  "messages": [{"role": "user", "content": "What is 15 * 23?"}],
  "think": true,
  "adapter": "my-lora-adapter"
}

Response - adds thinking in message and adapter metadata:

{
  "id": "chatcmpl-123",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "345",
      "thinking": "Let me calculate step by step..."
    }
  }],
  "adapter": {
    "requested": "my-lora-adapter",
    "actual": "my-lora-adapter",
    "fallback": false
  }
}

The client library exposes these as keyword arguments:

response = client.chat(messages, think=True, adapter="my-adapter")
print(response.thinking)  # Reasoning content
print(response.content)   # Final answer

Multiple Backends

# Anthropic
async with factory.anthropic(model="claude-sonnet-4-20250514") as client:
    response = await client.chat_async(messages)

# OpenAI
with factory.openai(base_url="https://api.openai.com/v1", api_key="sk-...") as client:
    response = client.chat(messages)

Engines

Engine Description Install
ollama (default) Wraps Ollama server ollama.com
vllm vLLM Python API pip install vllm
vllm-server vLLM HTTP subprocess pip install vllm
native Custom torch implementation pip install llm-infer[runtime]
llm-infer serve --model qwen2.5:7b                          # Ollama
llm-infer serve --engine vllm --model-path /path/to/model   # vLLM
llm-infer serve --engine native --model-path /path/to/model # Native

Native Engine

The native engine is a from-scratch torch implementation with PagedAttention and FlashInfer. Useful for learning how LLM inference works or experimenting with custom modifications.

pip install llm-infer[runtime]
llm-infer serve --engine native --model-path /path/to/model

Configuration

# etc/llm-infer.yaml
backends:
  engine: ollama

models:
  locations:
    - /path/to/models
  selection:
    generate:
      default: qwen2.5-7b
    embed:
      default: bge-small-en-v1.5

api:
  host: 0.0.0.0
  port: 8000

Per-model overrides in etc/models.yaml:

models:
  qwen2.5-7b:
    max_model_len: 8192
    vllm:
      enforce_eager: true

  qwen2.5:7b:
    ollama: qwen2.5:7b  # Ollama model name mapping

API Endpoints

Endpoint Description
POST /v1/chat/completions Chat completion (OpenAI-compatible)
POST /v1/completions Text completion (OpenAI-compatible)
GET /v1/models List available models
GET /health Health check
GET /metrics Prometheus metrics

Installation

pip install llm-infer              # Client only
pip install llm-infer[anthropic]   # With Anthropic support
pip install llm-infer[saia]        # With llm-saia integration
pip install llm-infer[runtime]     # With native engine (torch)

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_infer-0.3.0.tar.gz (389.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_infer-0.3.0-py3-none-any.whl (263.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_infer-0.3.0.tar.gz.

File metadata

  • Download URL: llm_infer-0.3.0.tar.gz
  • Upload date:
  • Size: 389.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_infer-0.3.0.tar.gz
Algorithm Hash digest
SHA256 62f684629be8dbd8b879bdb6afb551d49419de1fa4b84ce88b2cdadbfc49257b
MD5 0e9c3affc53f32a54fb007c1e754b2d6
BLAKE2b-256 10f203c71f13ddbc1d952f6a21d0046d7737e1e246a377d51daa15cbd88d2aee

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_infer-0.3.0.tar.gz:

Publisher: release.yml on llm-works/llm-infer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_infer-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: llm_infer-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 263.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_infer-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6dd46910793fdcf942ed994e42e0b3a0b25ed9d706fcc5049fc23883f5085a8f
MD5 3ccf80e5f22c218930a42dda269f45d5
BLAKE2b-256 3d37ade70c333f1e27aeb23610512fef244dbbbb5d81e09896c9270aafbf18c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_infer-0.3.0-py3-none-any.whl:

Publisher: release.yml on llm-works/llm-infer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page