Skip to main content

A readable LLM inference server implementing paged attention and continuous batching

Project description

llm-infer

Python Coverage Typed Linting: Ruff CI PyPI License

Unified CLI and client library for local LLM inference. Wraps Ollama, vLLM, and a native engine behind a single interface.

Components:

  • CLI & Server - Single command to serve models via Ollama, vLLM, or native torch engine
  • Client Package - Standard interface to multiple LLM backends (OpenAI, Anthropic, local servers)
  • Native Engine - Custom torch implementation for learning and experimentation

Quick Start

pip install llm-infer

# With Ollama (https://ollama.com)
ollama pull qwen2.5:0.5b
llm-infer serve --model qwen2.5:0.5b

# Query
llm-infer query "What is the capital of France?"

Client Package

llm_infer.client is a Python client library for LLM inference with a unified interface across backends. Built for autonomous agents and production use:

  • Multiple backends - OpenAI, Anthropic, and any OpenAI-compatible API
  • Sync, async, streaming - All execution modes supported
  • Rate limiting - Per-backend request throttling
  • Retry with backoff - Configurable exponential backoff on failures
  • Model routing - Route requests to backends by model name
  • Extensible - Register custom backends via Factory.register()
from appinfra.log import Logger
from llm_infer.client import Factory

lg = Logger("my-app")
factory = Factory(lg)

with factory.openai(base_url="http://localhost:8000/v1") as client:
    response = client.chat(
        messages=[{"role": "user", "content": "Hello!"}],
        system="You are a helpful assistant.",
    )
    print(response.content)

# Streaming
with factory.openai(base_url="http://localhost:8000/v1") as client:
    messages = [{"role": "user", "content": "Hello!"}]
    for token in client.chat_stream(messages):
        print(token, end="", flush=True)

# Async
async with factory.openai(base_url="http://localhost:8000/v1") as client:
    messages = [{"role": "user", "content": "Hello!"}]
    response = await client.chat_async(messages)

Protocol Extensions

The server extends the OpenAI chat completions API:

Request - adds think and adapter fields:

{
  "model": "default",
  "messages": [{"role": "user", "content": "What is 15 * 23?"}],
  "think": true,
  "adapter": "my-lora-adapter"
}

Response - adds thinking in message and adapter metadata:

{
  "id": "chatcmpl-123",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "345",
      "thinking": "Let me calculate step by step..."
    }
  }],
  "adapter": {
    "requested": "my-lora-adapter",
    "actual": "my-lora-adapter",
    "fallback": false
  }
}

The client library exposes these as keyword arguments:

response = client.chat(messages, think=True, adapter="my-adapter")
print(response.thinking)  # Reasoning content
print(response.content)   # Final answer

Multiple Backends

# Anthropic
async with factory.anthropic(model="claude-sonnet-4-20250514") as client:
    response = await client.chat_async(messages)

# OpenAI
with factory.openai(base_url="https://api.openai.com/v1", api_key="sk-...") as client:
    response = client.chat(messages)

Engines

Engine Description Install
ollama (default) Wraps Ollama server ollama.com
vllm vLLM Python API pip install vllm
vllm-server vLLM HTTP subprocess pip install vllm
native Custom torch implementation pip install llm-infer[runtime]
llm-infer serve --model qwen2.5:7b                          # Ollama
llm-infer serve --engine vllm --model-path /path/to/model   # vLLM
llm-infer serve --engine native --model-path /path/to/model # Native

Native Engine

The native engine is a from-scratch torch implementation with PagedAttention and FlashInfer. Useful for learning how LLM inference works or experimenting with custom modifications.

pip install llm-infer[runtime]
llm-infer serve --engine native --model-path /path/to/model

Configuration

# etc/llm-infer.yaml
backends:
  engine: ollama

models:
  locations:
    - /path/to/models
  selection:
    generate:
      default: qwen2.5-7b
    embed:
      default: bge-small-en-v1.5

api:
  host: 0.0.0.0
  port: 8000

Per-model overrides in etc/models.yaml:

models:
  qwen2.5-7b:
    max_model_len: 8192
    vllm:
      enforce_eager: true

  qwen2.5:7b:
    ollama: qwen2.5:7b  # Ollama model name mapping

API Endpoints

Endpoint Description
POST /v1/chat/completions Chat completion (OpenAI-compatible)
POST /v1/completions Text completion (OpenAI-compatible)
GET /v1/models List available models
GET /health Health check
GET /metrics Prometheus metrics

Installation

pip install llm-infer              # Client only
pip install llm-infer[anthropic]   # With Anthropic support
pip install llm-infer[saia]        # With llm-saia integration
pip install llm-infer[runtime]     # With native engine (torch)

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_infer-0.2.0.tar.gz (334.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_infer-0.2.0-py3-none-any.whl (260.8 kB view details)

Uploaded Python 3

File details

Details for the file llm_infer-0.2.0.tar.gz.

File metadata

  • Download URL: llm_infer-0.2.0.tar.gz
  • Upload date:
  • Size: 334.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_infer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b60586dac805ba5bb499bd7c90f8c107938831459e613b1a26abab750e1c3e00
MD5 b1152e49ab97257ffece256db13bc577
BLAKE2b-256 40e0b125cdf783a21448f79f3449a2a7d0d4ebb53be03301a6143a666aa6eeec

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_infer-0.2.0.tar.gz:

Publisher: release.yml on llm-works/llm-infer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_infer-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: llm_infer-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 260.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_infer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f575b16d1df23326f891b539dde8609c5cf8211de8c8c4eea7bb0065801e492f
MD5 b577d52f850bad2e62ed82e1cd5b0c85
BLAKE2b-256 a9655ca2cc81f4dacd4f52aa2490f9221e2715b02fca7423a852260f4b12fa1a

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_infer-0.2.0-py3-none-any.whl:

Publisher: release.yml on llm-works/llm-infer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page