Skip to main content

A readable LLM inference server implementing paged attention and continuous batching

Project description

llm-infer

Python Coverage Typed Linting: Ruff CI PyPI License

Unified CLI and client library for local LLM inference. Wraps Ollama, vLLM, and a native engine behind a single interface.

Components:

  • CLI & Server - Single command to serve models via Ollama, vLLM, or native torch engine
  • Client Package - Standard interface to multiple LLM backends (OpenAI, Anthropic, local servers)
  • Native Engine - Custom torch implementation for learning and experimentation

Quick Start

pip install llm-infer

# With Ollama (https://ollama.com)
ollama pull qwen2.5:0.5b
llm-infer serve --model qwen2.5:0.5b

# Query
llm-infer query "What is the capital of France?"

Client Package

llm_infer.client is a Python client library for LLM inference with a unified interface across backends. Built for autonomous agents and production use:

  • Multiple backends - OpenAI, Anthropic, and any OpenAI-compatible API
  • Sync, async, streaming - All execution modes supported
  • Rate limiting - Per-backend request throttling
  • Retry with backoff - Configurable exponential backoff on failures
  • Model routing - Route requests to backends by model name
  • Extensible - Register custom backends via Factory.register()
from appinfra.log import Logger
from llm_infer.client import Factory

lg = Logger("my-app")
factory = Factory(lg)

with factory.openai(base_url="http://localhost:8000/v1") as client:
    response = client.chat(
        messages=[{"role": "user", "content": "Hello!"}],
        system="You are a helpful assistant.",
    )
    print(response.content)

# Streaming
with factory.openai(base_url="http://localhost:8000/v1") as client:
    messages = [{"role": "user", "content": "Hello!"}]
    for token in client.chat_stream(messages):
        print(token, end="", flush=True)

# Async
async with factory.openai(base_url="http://localhost:8000/v1") as client:
    messages = [{"role": "user", "content": "Hello!"}]
    response = await client.chat_async(messages)

Protocol Extensions

The server extends the OpenAI chat completions API:

Request - adds think and adapter fields:

{
  "model": "default",
  "messages": [{"role": "user", "content": "What is 15 * 23?"}],
  "think": true,
  "adapter": "my-lora-adapter"
}

Response - adds thinking in message and adapter metadata:

{
  "id": "chatcmpl-123",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "345",
      "thinking": "Let me calculate step by step..."
    }
  }],
  "adapter": {
    "requested": "my-lora-adapter",
    "actual": "my-lora-adapter",
    "fallback": false
  }
}

The client library exposes these as keyword arguments:

response = client.chat(messages, think=True, adapter="my-adapter")
print(response.thinking)  # Reasoning content
print(response.content)   # Final answer

Multiple Backends

# Anthropic
async with factory.anthropic(model="claude-sonnet-4-20250514") as client:
    response = await client.chat_async(messages)

# OpenAI
with factory.openai(base_url="https://api.openai.com/v1", api_key="sk-...") as client:
    response = client.chat(messages)

Engines

Engine Description Install
ollama (default) Wraps Ollama server ollama.com
vllm vLLM Python API pip install vllm
vllm-server vLLM HTTP subprocess pip install vllm
native Custom torch implementation pip install llm-infer[runtime]
llm-infer serve --model qwen2.5:7b                          # Ollama
llm-infer serve --engine vllm --model-path /path/to/model   # vLLM
llm-infer serve --engine native --model-path /path/to/model # Native

Native Engine

The native engine is a from-scratch torch implementation with PagedAttention and FlashInfer. Useful for learning how LLM inference works or experimenting with custom modifications.

pip install llm-infer[runtime]
llm-infer serve --engine native --model-path /path/to/model

Configuration

# etc/llm-infer.yaml
backends:
  engine: ollama

models:
  locations:
    - /path/to/models
  selection:
    generate:
      default: qwen2.5-7b
    embed:
      default: bge-small-en-v1.5

api:
  host: 0.0.0.0
  port: 8000

Per-model overrides in etc/models.yaml:

models:
  qwen2.5-7b:
    max_model_len: 8192
    vllm:
      enforce_eager: true

  qwen2.5:7b:
    ollama: qwen2.5:7b  # Ollama model name mapping

API Endpoints

Endpoint Description
POST /v1/chat/completions Chat completion (OpenAI-compatible)
POST /v1/completions Text completion (OpenAI-compatible)
GET /v1/models List available models
GET /health Health check
GET /metrics Prometheus metrics

Installation

pip install llm-infer              # Client only
pip install llm-infer[anthropic]   # With Anthropic support
pip install llm-infer[saia]        # With llm-saia integration
pip install llm-infer[runtime]     # With native engine (torch)

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_infer-0.1.1.tar.gz (307.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_infer-0.1.1-py3-none-any.whl (238.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_infer-0.1.1.tar.gz.

File metadata

  • Download URL: llm_infer-0.1.1.tar.gz
  • Upload date:
  • Size: 307.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_infer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5178ab9b935059e98a79e0916e0d09c20cbfbfca9b2ea420488104d9c8ad53bb
MD5 2f631a77d3123345ef1624535524e1a6
BLAKE2b-256 84824edb195b8e75fb34f4ba1ac58ade563d5783b761a8816fa246815c606859

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_infer-0.1.1.tar.gz:

Publisher: release.yml on serendip-ml/llm-infer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_infer-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llm_infer-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 238.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_infer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b1870aeb6c225550352c8c9df51b350fa11b56a8f2fe5ab771a4dc2f9da7c200
MD5 0f1605cc895152a657bc332bd82b1ffb
BLAKE2b-256 fee4ce2e5d8b58c8d69c6f60d4a6c2f9a5040a7d3707021d4660be3c7ac4905d

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_infer-0.1.1-py3-none-any.whl:

Publisher: release.yml on serendip-ml/llm-infer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page