Skip to main content

Minimal async LLM backend with caching and batch execution

Project description

minima-llm

Minimal async LLM backend with caching and batch execution.

Features

  • Zero Dependencies: Core package uses only Python stdlib (asyncio, urllib, sqlite3)
  • SQLite Cache: Automatic prompt caching with WAL mode for multi-process safety
  • Batch Execution: Worker pool pattern with heartbeat, failure tracking, and early abort
  • Rate Limiting: RPM pacing with server-learned limits from rate limit headers
  • Retry Logic: Exponential backoff with jitter, cooldown after overload
  • OpenAI Compatible: Works with any OpenAI-compatible endpoint
  • DSPy Integration: Optional adapter for DSPy framework (requires [dspy] extra)
  • Proxy Mode: OpenAI-compatible HTTP proxy server so any application (DSPy, LangChain, curl) gets caching and rate limiting

Installation

# Core only (no dependencies)
pip install minima-llm

# With DSPy support
pip install minima-llm[dspy]

# With YAML config support
pip install minima-llm[yaml]

# Development
pip install minima-llm[dev]

Quick Start

Basic Usage

import asyncio
from minima_llm import MinimaLlmConfig, OpenAIMinimaLlm, MinimaLlmRequest

async def main():
    # Configure from environment or explicit values
    config = MinimaLlmConfig(
        base_url="https://api.openai.com/v1",
        model="gpt-4",
        api_key="sk-...",
        cache_dir="./cache",
    )

    backend = OpenAIMinimaLlm(config)

    # Single request
    request = MinimaLlmRequest(
        request_id="q1",
        messages=[{"role": "user", "content": "What is 2+2?"}],
        temperature=0.0,
    )

    result = await backend.generate(request)
    print(result.text)

    await backend.aclose()

asyncio.run(main())

Batch Execution

import asyncio
from minima_llm import MinimaLlmConfig, OpenAIMinimaLlm, MinimaLlmRequest

async def main():
    config = MinimaLlmConfig.from_env()
    backend = OpenAIMinimaLlm(config)

    requests = [
        MinimaLlmRequest(
            request_id=f"q{i}",
            messages=[{"role": "user", "content": f"Question {i}"}],
        )
        for i in range(100)
    ]

    # Run batch with progress heartbeat
    results = await backend.run_batched(requests)

    for r in results:
        if hasattr(r, 'text'):
            print(f"{r.request_id}: {r.text[:50]}...")

    await backend.aclose()

asyncio.run(main())

With DSPy

import asyncio
import dspy
from minima_llm import MinimaLlmConfig, OpenAIMinimaLlm
from minima_llm.dspy_adapter import MinimaLlmDSPyLM

class QA(dspy.Signature):
    question = dspy.InputField()
    answer = dspy.OutputField()

async def main():
    config = MinimaLlmConfig.from_env()
    backend = OpenAIMinimaLlm(config)
    lm = MinimaLlmDSPyLM(backend)

    dspy.configure(lm=lm)

    predictor = dspy.ChainOfThought(QA)
    result = await predictor.acall(question="What is the capital of France?")
    print(result.answer)

    await backend.aclose()

asyncio.run(main())

Proxy Mode

minimallm-proxy starts a localhost HTTP server with an OpenAI-compatible API. Any application that speaks the OpenAI protocol can point to it and automatically benefit from minima-llm's prompt caching, rate limiting, backpressure, and retry logic.

Start the proxy

# Using environment variables (OPENAI_BASE_URL, OPENAI_MODEL, CACHE_DIR, etc.)
minimallm-proxy --port 8990

# With a YAML config file
minimallm-proxy --port 8990 --config config.yml

# Force all requests to use the configured OPENAI_MODEL (ignore client's model field)
minimallm-proxy --port 8990 --force-model

Send requests

Point any OpenAI-compatible client to http://localhost:8990/v1:

curl -X POST http://localhost:8990/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4","messages":[{"role":"user","content":"Hello"}]}'
# With litellm / DSPy
import os
os.environ["OPENAI_API_BASE"] = "http://localhost:8990/v1"

Options

Flag Default Description
--host 127.0.0.1 Bind address
--port 8990 Listen port
--config / -c env vars YAML config file
--force-model off Ignore client model, use OPENAI_MODEL

Supported endpoints

Endpoint Method Description
/v1/chat/completions POST Chat completions (non-streaming only)
/v1/models GET List the configured model

Streaming ("stream": true) is not supported and returns HTTP 400.

Batch Management

For long-running batch jobs using the OpenAI batch API, minima-llm provides batch state management with local state files for resumption after interruption.

Configuration

Enable Parasail batch mode in your config:

parasail:
  llm_batch_prefix: "my-project"  # Prefix for batch state files
  state_dir: "./batch-state"      # Directory for state files (defaults to cache_dir)
  poll_interval_s: 30             # How often to poll for completion
  max_poll_hours: 24              # Maximum time to wait

Batch Management Functions

These functions are available for programmatic batch management:

from minima_llm import (
    batch_status_overview,
    cancel_batch,
    cancel_all_batches,
    cancel_all_local_batches,
    MinimaLlmConfig,
)

config = MinimaLlmConfig.from_yaml("config.yml")

# Show status of all local batch state files
batch_status_overview(config)

# Cancel a specific batch by remote batch ID
cancel_batch("batch_abc123", config)

# Cancel all batches matching a prefix
cancel_all_batches(config, prefix="my-project")

# Cancel ALL local batches
cancel_all_local_batches(config)

Command Line Interface

minima-llm provides a standalone CLI for batch management:

# Show status of all batches (uses CACHE_DIR from environment)
minima-llm batch-status

# With explicit config file
minima-llm batch-status --config config.yml

# Cancel batches matching a prefix
minima-llm batch-status --cancel my-prefix

# Cancel a specific remote batch by ID
minima-llm batch-status --cancel-remote batch_abc123

# Cancel ALL local batches
minima-llm batch-status --cancel-all

When calling from a different directory, use absolute paths or set environment variables:

# Absolute path to config
minima-llm batch-status --config /path/to/project/config.yml

# Or set CACHE_DIR to find batch state files
CACHE_DIR=/path/to/project/cache minima-llm batch-status

Configuration

Environment Variables

Variable Description Default
OPENAI_BASE_URL API endpoint URL (required)
OPENAI_MODEL Model identifier (required)
OPENAI_API_KEY API key None
CACHE_DIR SQLite cache directory None (disabled)
BATCH_NUM_WORKERS Concurrent workers 64
MAX_OUTSTANDING Max in-flight HTTP requests 32
RPM Requests per minute (0=unlimited) 600
TIMEOUT_S Per-request timeout 60.0
MAX_ATTEMPTS Max retry attempts (0=infinite) 6
CACHE_FORCE_REFRESH Skip cache reads, still write 0 (disabled)
MINIMA_TRACE_FILE Cache key debug log (JSONL) None (disabled)

YAML Configuration

base_url: "https://api.openai.com/v1"
model: "gpt-4"
api_key: "sk-..."
cache_dir: "./cache"

# Optional batch settings
batch:
  num_workers: 64
  max_failures: 25
  heartbeat_s: 10.0

Load with:

config = MinimaLlmConfig.from_yaml("config.yml")

Prompt Caching

minima-llm includes an SQLite-backed prompt cache that stores LLM responses keyed by a SHA-256 hash of the request parameters (model, messages, temperature, max_tokens, extras). The database uses WAL mode for multi-process safety.

Enable / Disable

  • Enable: Set cache_dir to a directory path via environment variable, YAML, or code. The cache database is created at {cache_dir}/minima_llm.db.
  • Disable: Leave cache_dir unset (default). No cache files are created.
cache_dir: "./my-cache"

Force Refresh

Force refresh bypasses cache reads but still writes new responses to the cache, useful for regenerating stale entries.

  • Config-wide: Set CACHE_FORCE_REFRESH=1 env var, or force_refresh: true in YAML.
  • Per-request: Pass force_refresh=True to generate():
result = await backend.generate(request, force_refresh=True)

Debug Tracing

To diagnose cache misses, set MINIMA_TRACE_FILE to a file path. Every cache key computation is logged as a JSONL line containing the canonical JSON used for hashing and the resulting SHA-256 key:

MINIMA_TRACE_FILE=trace.jsonl python my_script.py

Each line has the form {"key": "<sha256>", "canonical": "<json>"}. Compare canonical JSON between runs to spot differences causing cache misses.

Architecture

minima_llm/
├── protocol.py      # AsyncMinimaLlmBackend protocol, Request/Response types
├── config.py        # MinimaLlmConfig, BatchConfig, ParasailBatchConfig
├── backend.py       # OpenAIMinimaLlm - full async backend with cache
├── batch.py         # run_batched_callable, Parasail batch support, batch management
├── proxy.py         # OpenAI-compatible HTTP proxy server (minimallm-proxy)
├── cli.py           # Command-line interface (minima-llm, minimallm-proxy)
└── dspy_adapter.py  # MinimaLlmDSPyLM, TolerantChatAdapter (optional)

Multi-Loop Support

The backend is designed to be reused across multiple asyncio.run() calls:

backend = OpenAIMinimaLlm(config)

# First asyncio.run()
asyncio.run(batch1(backend))

# Second asyncio.run() - works correctly
asyncio.run(batch2(backend))

This is achieved through lazy per-loop initialization of async primitives.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minima_llm-0.2.4.tar.gz (61.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

minima_llm-0.2.4-py3-none-any.whl (48.7 kB view details)

Uploaded Python 3

File details

Details for the file minima_llm-0.2.4.tar.gz.

File metadata

  • Download URL: minima_llm-0.2.4.tar.gz
  • Upload date:
  • Size: 61.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minima_llm-0.2.4.tar.gz
Algorithm Hash digest
SHA256 096c48c104c8806617491d996af5625ef9b6cd7768364018875ad4aafb84f7d9
MD5 1387d4c81da640c24425dbb054e67c57
BLAKE2b-256 6fdbfdbe938e5ce9697ac0041ec9531a1b04ff30dbef365df30e87727eba0ad4

See more details on using hashes here.

Provenance

The following attestation bundles were made for minima_llm-0.2.4.tar.gz:

Publisher: publish.yml on trec-auto-judge/minima-llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file minima_llm-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: minima_llm-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 48.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minima_llm-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 428deb6acac9d1b09da5982273d9e952d236be195498bc9a29889c46e2441657
MD5 1409619872d7505d19e72b636d97870e
BLAKE2b-256 bfbcc419754ad696d7e2500532b29650838826d7a3b6d867407c23eb97316e69

See more details on using hashes here.

Provenance

The following attestation bundles were made for minima_llm-0.2.4-py3-none-any.whl:

Publisher: publish.yml on trec-auto-judge/minima-llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page