A readable LLM inference server implementing paged attention and continuous batching

These details have not been verified by PyPI

Project description

llm-infer

Python Coverage License

Unified CLI and client library for local LLM inference. Wraps Ollama, vLLM, and a native engine behind a single interface.

Components:

CLI & Server - Single command to serve models via Ollama, vLLM, or native torch engine
Client Package - Standard interface to multiple LLM backends (OpenAI, Anthropic, local servers)
Native Engine - Custom torch implementation for learning and experimentation

Quick Start

pip install llm-infer

# With Ollama (https://ollama.com)
ollama pull qwen2.5:0.5b
llm-infer serve --model qwen2.5:0.5b

# Query
llm-infer query "What is the capital of France?"

Client Package

llm_infer.client is a Python client library for LLM inference with a unified interface across backends. Built for autonomous agents and production use:

Multiple backends - OpenAI, Anthropic, and any OpenAI-compatible API
Sync, async, streaming - All execution modes supported
Rate limiting - Per-backend request throttling
Retry with backoff - Configurable exponential backoff on failures
Model routing - Route requests to backends by model name
Extensible - Register custom backends via Factory.register()

from appinfra.log import Logger
from llm_infer.client import Factory

lg = Logger("my-app")
factory = Factory(lg)

with factory.openai(base_url="http://localhost:8000/v1") as client:
    response = client.chat(
        messages=[{"role": "user", "content": "Hello!"}],
        system="You are a helpful assistant.",
    )
    print(response.content)

# Streaming
with factory.openai(base_url="http://localhost:8000/v1") as client:
    messages = [{"role": "user", "content": "Hello!"}]
    for token in client.chat_stream(messages):
        print(token, end="", flush=True)

# Async
async with factory.openai(base_url="http://localhost:8000/v1") as client:
    messages = [{"role": "user", "content": "Hello!"}]
    response = await client.chat_async(messages)

Protocol Extensions

The server extends the OpenAI chat completions API:

Request - adds think and adapter fields:

{
  "model": "default",
  "messages": [{"role": "user", "content": "What is 15 * 23?"}],
  "think": true,
  "adapter": "my-lora-adapter"
}

Response - adds thinking in message and adapter metadata:

{
  "id": "chatcmpl-123",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "345",
      "thinking": "Let me calculate step by step..."
    }
  }],
  "adapter": {
    "requested": "my-lora-adapter",
    "actual": "my-lora-adapter",
    "fallback": false
  }
}

The client library exposes these as keyword arguments:

response = client.chat(messages, think=True, adapter="my-adapter")
print(response.thinking)  # Reasoning content
print(response.content)   # Final answer

Multiple Backends

# Anthropic
async with factory.anthropic(model="claude-sonnet-4-20250514") as client:
    response = await client.chat_async(messages)

# OpenAI
with factory.openai(base_url="https://api.openai.com/v1", api_key="sk-...") as client:
    response = client.chat(messages)

Engines

Engine	Description	Install
`ollama` (default)	Wraps Ollama server	ollama.com
`vllm`	vLLM Python API	`pip install vllm`
`vllm-server`	vLLM HTTP subprocess	`pip install vllm`
`native`	Custom torch implementation	`pip install llm-infer[runtime]`

llm-infer serve --model qwen2.5:7b                          # Ollama
llm-infer serve --engine vllm --model-path /path/to/model   # vLLM
llm-infer serve --engine native --model-path /path/to/model # Native

Native Engine

The native engine is a from-scratch torch implementation with PagedAttention and FlashInfer. Useful for learning how LLM inference works or experimenting with custom modifications.

pip install llm-infer[runtime]
llm-infer serve --engine native --model-path /path/to/model

Configuration

# etc/llm-infer.yaml
backends:
  engine: ollama

models:
  locations:
    - /path/to/models
  selection:
    generate:
      default: qwen2.5-7b
    embed:
      default: bge-small-en-v1.5

api:
  host: 0.0.0.0
  port: 8000

Per-model overrides in etc/models.yaml:

models:
  qwen2.5-7b:
    max_model_len: 8192
    vllm:
      enforce_eager: true

  qwen2.5:7b:
    ollama: qwen2.5:7b  # Ollama model name mapping

API Endpoints

Endpoint	Description
`POST /v1/chat/completions`	Chat completion (OpenAI-compatible)
`POST /v1/completions`	Text completion (OpenAI-compatible)
`GET /v1/models`	List available models
`GET /health`	Health check
`GET /metrics`	Prometheus metrics

Installation

pip install llm-infer              # Client only
pip install llm-infer[anthropic]   # With Anthropic support
pip install llm-infer[saia]        # With llm-saia integration
pip install llm-infer[runtime]     # With native engine (torch)

License

Apache License 2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

May 10, 2026

0.3.0

Apr 13, 2026

0.2.0

Mar 14, 2026

0.1.1

Feb 26, 2026

0.1.0

Feb 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_infer-0.4.0.tar.gz (410.4 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_infer-0.4.0-py3-none-any.whl (283.0 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file llm_infer-0.4.0.tar.gz.

File metadata

Download URL: llm_infer-0.4.0.tar.gz
Upload date: May 10, 2026
Size: 410.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_infer-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`3e82f410ef313832f217884a7edfc5915f99a0dc96b7444c91a3ba6da74fc368`
MD5	`c2cd746c152940e331b1326d33adf44d`
BLAKE2b-256	`ed5871034e65bf370cc5ed3eb65727f3ff4ee1306942056dc18f71719c3f3501`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_infer-0.4.0.tar.gz:

Publisher: release.yml on llm-works/llm-infer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_infer-0.4.0.tar.gz
- Subject digest: 3e82f410ef313832f217884a7edfc5915f99a0dc96b7444c91a3ba6da74fc368
- Sigstore transparency entry: 1489371341
- Sigstore integration time: May 10, 2026
Source repository:
- Permalink: llm-works/llm-infer@d219597b69bde7a547a29eceb992d194e47fd8f0
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/llm-works
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d219597b69bde7a547a29eceb992d194e47fd8f0
- Trigger Event: push

File details

Details for the file llm_infer-0.4.0-py3-none-any.whl.

File metadata

Download URL: llm_infer-0.4.0-py3-none-any.whl
Upload date: May 10, 2026
Size: 283.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_infer-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`79dd058c45fec83339487c115be5724ae8cf5678a64fce771467b59c38a9472c`
MD5	`afe5cfe9115e66b792075d05d9118782`
BLAKE2b-256	`8ffc09f66b9622d7eeb774293d2e7454e7a4b9f5c5b40ed08ca92d4d2317e300`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_infer-0.4.0-py3-none-any.whl:

Publisher: release.yml on llm-works/llm-infer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_infer-0.4.0-py3-none-any.whl
- Subject digest: 79dd058c45fec83339487c115be5724ae8cf5678a64fce771467b59c38a9472c
- Sigstore transparency entry: 1489372072
- Sigstore integration time: May 10, 2026
Source repository:
- Permalink: llm-works/llm-infer@d219597b69bde7a547a29eceb992d194e47fd8f0
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/llm-works
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d219597b69bde7a547a29eceb992d194e47fd8f0
- Trigger Event: push

llm-infer 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

llm-infer

Quick Start

Client Package

Protocol Extensions

Multiple Backends

Engines

Native Engine

Configuration

API Endpoints

Installation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance