Skip to main content

A thin, opinionated, local-first structured-output + logging layer over LiteLLM

Project description

llmkit

A thin, opinionated, local-first layer over LiteLLM (with instructor for structured output). It gives an application one provider-agnostic call surface across OpenRouter, Google, Anthropic, OpenAI, DeepSeek, AWS Bedrock, and local Ollama, with validated structured output, per-provider rate limiting (concurrency on by default; optional requests-/tokens-per-minute), agent-readable per-call logging, and transient-error retries on by default — all out of the box.

LiteLLM is the implementation of the HTTP providers; llmkit owns the ergonomic call surface, the structured-output mode pinning, the rate-limit policy, and the logging convention. It is not a gateway and does not reimplement transport — that is solved, and reimplementing it is the thing this library deliberately does not do.

Why llmkit

  • Structured output that actually validates. Each provider is pinned to its native JSON-schema mode (never instructor's auto-Mode.TOOLS, which silently regresses Gemini to empty shapes), and instructor's in-call validation-retry repairs truncated JSON. You pass a Pydantic model; you get a validated instance back.
  • Provider switching is config, not code. OpenRouter / Google / Anthropic / OpenAI / DeepSeek / AWS Bedrock / Ollama behind one Provider enum and one LLMClientConfig. Call sites never change when you switch.
  • Logging tuned for coding agents. Every call is logged verdict-first (see below) — the design assumption is that the reader is usually an LLM coding agent debugging a run, not a dashboard.
  • Local-first, zero infra. The default sink writes plain files to a directory. No collector, no account, no network. A pluggable LogSink lets you ship records anywhere later without touching call sites.

These four are the headline; PRINCIPLES.md states the full set of design principles behind the library.

Install

uv add omg-llmkit          # or: pip install omg-llmkit

The distribution is published as omg-llmkit (the bare llmkit name was already taken on PyPI), but the import name is just llmkit:

import llmkit

You install omg-llmkit but import llmkit — that split trips a natural post-install smoke test. A mistaken import omg_llmkit (the install name) raises a clear one-line redirect to import llmkit, not a bare ModuleNotFoundError that leaves you guessing.

Requires Python ≥ 3.13.

The core install routes OpenRouter, Google, OpenAI, DeepSeek, and Ollama with no extra dependencies. Two providers gate their dependencies behind opt-in extras so hosts pay only for what they call:

pip install "omg-llmkit[anthropic]"  # direct Anthropic (Claude) routing
pip install "omg-llmkit[bedrock]"    # Claude-on-Bedrock (pulls in [anthropic] too)

The Anthropic SDK is opt-in because instructor reaches it only at call time, on its ANTHROPIC_JSON usage-accounting path — plain import llmkit and a Google-only flow never touch it. Constructing the AnthropicProvider or BedrockProvider without the SDK raises a clear install omg-llmkit[anthropic] error at construction, not a cryptic failure on the first call.

Quick start

from pydantic import BaseModel
from llmkit import (
    LLMClientConfig,
    Provider,
    configure_llm_client,
    structured_llm_call,
)

# Point the library at a provider once, at startup.
configure_llm_client(lambda: LLMClientConfig(
    provider=Provider.OPENROUTER,
    model="google/gemini-2.5-flash",
    api_key="sk-or-...",
))

class Summary(BaseModel):
    title: str
    bullets: list[str]

result: Summary = await structured_llm_call(
    prompt="Summarize the attached report.",
    output_schema=Summary,
    feature="reports",      # groups calls in the logs
    label="exec_summary",   # names this specific call in the logs
)

The public call surface:

Function Use
structured_llm_call(prompt, output_schema, feature, label, ...) Async, returns a validated Pydantic instance
structured_llm_call_sync(...) Synchronous wrapper around the above
text_llm_call(prompt, feature, label, ...) Async, returns plain text (coerces provider list-content blocks)
text_llm_call_sync(...) Synchronous wrapper around the above
stream_text_with_log(prompt, feature, label, ...) Async generator yielding text chunks, logged on completion

Two defaults worth knowing up front.

  • temperature defaults to 0.2 — biased toward deterministic output. A creative caller must override it explicitly (e.g. temperature=1.0); it is otherwise quietly conservative.
  • Any call takes a per-call provider= override — route a single call through a different provider family, model, or credential without touching the global configure_llm_client(...) registration. See Constructing a provider for a per-call override.

Reusing call options

The call functions (structured_llm_call, structured_llm_call_sync, text_llm_call, text_llm_call_sync, and stream_text_with_log) take up to nine keyword arguments. When a feature module makes many calls with the same settings, repeating that block at every site is noise. Build an LLMCallOptions once and pass it as options=:

from llmkit import LLMCallOptions, structured_llm_call

# Built once per feature module.
RISK_OPTS = LLMCallOptions(
    temperature=0.0,
    model="gemini-2.5-flash",
    reasoning_effort="high",
    max_tokens=2048,
)

async def extract(prompt: str) -> RiskRegister:
    return await structured_llm_call(
        prompt, RiskRegister, feature="extraction", options=RISK_OPTS
    )

LLMCallOptions is frozen and carries any subset of temperature / model / max_tokens / reasoning_effort / retry / provider. Every field is optional and unset by default — an unset field defers to the call's keyword (and through it to the configured client), so a partially-filled LLMCallOptions only supplies the fields you set.

feature is intentionally not part of LLMCallOptions. It stays a required per-call keyword as a telemetry forcing function: it scopes the per-call log filename and the index.jsonl grouping operators grep, so it must be a conscious choice at each call site rather than something defaulted-away into a shared object.

The flat-keyword path is unchanged — pass no options and nothing about existing calls changes.

Call-vs-config precedence

model and reasoning_effort are dual-homed: they can be set both on LLMClientConfig (the app-wide default) and on the call surface. The precedence, lowest to highest, is:

config < options < explicit per-call keyword

So a value passed directly as a keyword wins; an LLMCallOptions field sits between the keyword and the config; and when neither the keyword nor options supplies a value, the configured LLMClientConfig default applies (e.g. model=None defers to the provider/config default). An unset LLMCallOptions field never overrides config — only a field you explicitly set on the options participates.

Contracts as JSON-schema dicts

If your structured-output contract is a JSON-schema dict — typically because the same schema is shared with a Node backend or a frontend — model_from_json_schema(schema) converts it to a Pydantic model at runtime, so you don't hand-write the converter (and re-discover its footguns). Build the model once and reuse it; structured_llm_call stays Pydantic-model-only and takes the result as output_schema.

from llmkit import model_from_json_schema, structured_llm_call

INVOICE_SCHEMA = {                       # shared with Node / the frontend
    "title": "Invoice",
    "type": "object",
    "properties": {
        "id": {"type": "string"},
        "total": {"type": "number"},
        "status": {"enum": ["open", "closed", "void"]},
        "note": {"type": ["string", "null"]},          # optional, nullable
        "lines": {"type": "array", "items": {"$ref": "#/$defs/Line"}},
    },
    "required": ["id", "total", "status", "lines"],
    "$defs": {
        "Line": {
            "type": "object",
            "properties": {"sku": {"type": "string"}, "qty": {"type": "integer"}},
            "required": ["sku"],
        }
    },
}

Invoice = model_from_json_schema(INVOICE_SCHEMA)   # build once, at import

result = await structured_llm_call(
    prompt="Extract the invoice.",
    output_schema=Invoice,                         # reuse on every call
    feature="billing",
)

Supported subset (anything outside it raises a clear ValueError naming the construct): object with properties and a required array; scalars (string / integer / number / boolean, plus null / nullable); array with items (including arrays of objects); enum (string or integer members); nested objects inline or via local $ref (#/$defs/...); and additionalProperties as true / false / absent (a typed additionalProperties map is rejected). A non-required field becomes an optional defaulting to None, and the generated model's model_dump / model_dump_json drop a None left in an optional field by default — so an omitted optional is absent, not "field": null (which would fail downstream re-validation against the same schema). The drop is scoped to optionals: a required-but-nullable field explicitly set to None is kept. Pass exclude_none=False to keep every null, or exclude_none=True to drop them all. A title-less schema still gets a valid default class name (JsonSchemaModel); pass name= to set it explicitly. Generated models default to extra="forbid", so a response carrying a key not in the schema is rejected rather than silently kept — for an LLM output contract you want a hallucinated extra field to fail loudly (stricter than JSON Schema's permissive additionalProperties default); "additionalProperties": true opts an object into extra="allow" (extra keys accepted and kept), while false or absent stays strict. An explicit "type": "object" with no properties raises rather than silently building a zero-field model that rejects every real response — set "additionalProperties": true for an intentionally free-form object.

Want plain data back, not a model instance? Call .model_dump() on the result — it inherits the optional-None drop above, so the dict matches the schema:

Person = model_from_json_schema(person_schema)   # build once, at import

result = await structured_llm_call(prompt, Person, feature="extraction")
data = result.model_dump()                        # {"name": "Ada", "age": 36}

Schema constraints

model_from_json_schema carries a small, fixed set of per-field JSON-schema constraints through to the generated Pydantic Field, so the model validates value bounds, not just shape. The supported set is exactly:

JSON schema Pydantic Field Applies to
minimum ge numbers / integers
maximum le numbers / integers
exclusiveMinimum gt numbers / integers
exclusiveMaximum lt numbers / integers
minLength min_length strings
maxLength max_length strings
minItems min_length arrays
maxItems max_length arrays
description Field(description=...) any field (surfaced to the model by instructor)
Score = model_from_json_schema(
    {
        "type": "object",
        "properties": {"score": {"type": "integer", "minimum": 1, "maximum": 5}},
        "required": ["score"],
    }
)
Score(score=3)   # ok
Score(score=6)   # raises pydantic.ValidationError

Bounds are resolved through $ref chains of any depth and through nullable wrappers, so a constraint declared inside a $def (even several $ref hops deep) or on the non-null branch of a nullable field is still enforced (and null itself still passes for a nullable field).

One form caveat: exclusiveMinimum / exclusiveMaximum are recognised in their numeric (Draft 2020-12) form only. The Draft-4 / OpenAPI-3.0 boolean form ("exclusiveMinimum": true qualifying a sibling "minimum") is not recognised and is dropped — the bound is enforced as the sibling's inclusive minimum/maximum. If your schema comes from an OpenAPI 3.0 document, rewrite exclusive bounds in the numeric form.

Anything outside the table above is silently droppedpattern, format, multipleOf, uniqueItems, const, and the rest are not enforced. This is deliberate: partial enforcement that looks complete is worse than none. If a schema relies on one of those, validate it elsewhere.

Rate limiting

Rate limiting is on by default, scoped per provider (keyed by the effective provider name, matching how logging records it), across three independent dimensions:

  • Concurrencyon by default, default cap 8 concurrent calls per provider: enough headroom for the fan-out workloads consumers actually run, while still bounding a self-inflicted burst; lower it for a tightly-metered account, raise it for a local Ollama server. The cap binds async callers and the *_sync wrappers alike — a thread-pool fan-out of sync calls shares one per-provider cap. One caveat: async callers (on a shared event loop) and sync callers (in other threads) are capped on independent semaphores, so a workload mixing both populations can momentarily hold up to 2 × the cap per provider; RPM/TPM budgets are shared across both.
  • Requests per minute (RPM)opt-in, off by default. A per-provider request-rate ceiling.
  • Tokens per minute (TPM)opt-in, off by default. A per-provider token-rate ceiling, debited by each call's measured token usage.

configure_rate_limit(max_concurrent=..., enabled=..., rpm=..., tpm=...) sets them; get_rate_limit_config() reads back the effective enabled / max_concurrent / rpm / tpm (handy to log or assert at startup); configure_llm_logging(sink) swaps the log sink (below).

from llmkit import configure_rate_limit

# Stay under a metered account's published per-minute limits:
configure_rate_limit(rpm=3_500, tpm=2_000_000)

RPM and TPM are opt-in because — unlike concurrency, which has a universally sane default of 8 — the right per-minute number is the metered limit of your account, with no safe default to assume. Leaving them unset sends a request byte-identical to the pre-feature behaviour (no throttle on those dimensions). The binding limit on a metered cloud account is usually RPM/TPM rather than concurrency, so a migrator coming from a requests-per-minute knob should set rpm= here — the concurrency cap does not stand in for an RPM limit (the two limit different things, and an old RPM tuning otherwise goes inert). Both use a per-provider token bucket, which tolerates a small burst above the configured ceiling and then smooths to the sustained rate. That burst is deliberately small — min(max_concurrent, rpm) requests for RPM, roughly one second of tokens for TPM — not a full minute's quota. Against a provider that enforces a strict fixed minute window, the burst is the worst-case overshoot, so its relative size scales with your limits: with the default max_concurrent=8 it is negligible at rpm=3_500 (~0.2%) but a meaningful fraction of a small limit (8 extra requests on rpm=50 is 16%). A tightly-metered account should lower max_concurrent (which shrinks the RPM burst with it) or set rpm= a little below the published number to leave headroom. (A streamed call usually reports no token usage, so it does not debit TPM — consistent with cost being None for streamed calls.)

Joining the global rate limit directly

llmkit's own call functions already pass every provider call through the global, per-provider limit (concurrency on by default; RPM/TPM when configured). If your app issues provider calls outside those functions — for example a LangChain chat-model wrapper that calls the provider itself — you can join the same per-provider budget by hand with the module-level acquire functions:

from llmkit.rate_limiting import (
    rate_limit_acquire_async,
    rate_limit_acquire_sync,
)

# Async path (e.g. an async _agenerate):
async with rate_limit_acquire_async("openai") as slot:
    response = ...  # one slot held against openai's budget
    slot.record_tokens(response.usage.total_tokens)  # debits TPM (no-op when off)

# Sync path (e.g. a synchronous _generate / _stream):
with rate_limit_acquire_sync("openai") as slot:
    response = ...  # one slot held against openai's budget
    slot.record_tokens(response.usage.total_tokens)

The argument is the provider name (provider.name, e.g. "openai", "ollama"); each provider has an independent budget on every dimension. Each context manager yields a RateLimitSlot; call its record_tokens(...) once you know the call's token usage to debit the TPM budget (a no-op when TPM is off). Both are no-ops when rate limiting is disabled, and they share the exact throttle llmkit's own call paths use, so a hand-joined slot counts against the same budgets.

To check whether limiting is currently active, read the effective config rather than reaching into the limiter:

from llmkit.rate_limiting import get_rate_limit_config

if get_rate_limit_config().enabled:
    ...

get_rate_limit_config().enabled is the public replacement for the old GlobalRateLimiter.is_enabled() check; GlobalRateLimiter itself is no longer part of the headline surface (it remains importable from llmkit.rate_limiting for internal use).

Logging: agent-readable by default

LocalYamlLogSink (the default) writes two things to data/llm-logs/:

  1. One YAML file per call, laid out verdict-first. The file opens with a one-line # header — ok/ERROR, feature/label, resolved model, schema, duration, approximate cost — so head -1 *.yaml triages a whole run. Small metadata is next; the large response and prompt blobs are last, so the head of the file is the whole story for most reads.
  2. A compact append-only index.jsonl — one JSON line per call (file, timestamp, feature, label, model, provider, schema, duration, cost, error). Cross-call questions — "which calls errored / were slowest / most expensive / the last call for feature X" — are a single small scan instead of globbing and parsing every YAML.
# ok | reports/exec_summary | google/gemini-2.5-flash | Summary | 1840ms | $0.0007
# 2026-06-05T14:22:31.004512

timestamp: '2026-06-05T14:22:31.004512'
feature: reports
label: exec_summary
model: google/gemini-2.5-flash
provider: openrouter
schema: Summary
temperature: 0.0
duration_ms: 1840.2
approximate_cost: 0.0007
error: null
response: ...
prompt: ...

approximate_cost is LiteLLM's per-response estimate for budget visibility — not a billing figure (and None when the provider does not report it, e.g. streamed calls).

Capturing call records

Every call function (structured_llm_call, structured_llm_call_sync, text_llm_call, text_llm_call_sync, and stream_text_with_log) builds an LLMCallRecord and hands it to the configured log sink. A higher-level orchestrator that needs to cross-reference those calls — to total approximate cost, attribute spend per feature, or weave per-call traces — has two additive capture primitives, neither of which requires authoring a sink.

capture_llm_records() — records (cost / metadata). Wrap a scope to receive the LLMCallRecord for every call made inside it. Each record carries approximate_cost (a best-effort USD estimate, None when the provider doesn't report it), the resolved model/provider, duration_ms, error, and the rest — so a host gets cost and metadata without writing a custom sink. Capture is sink-independent: it works even with logging disabled (configure_llm_logging(None)), and crosses the run_sync sync bridge, so structured_llm_call_sync is captured exactly like the async path. One record is appended per attempt (retries each produce their own).

from llmkit import capture_llm_records, structured_llm_call

with capture_llm_records() as records:
    result = await structured_llm_call(prompt, MySchema, feature="extraction")

total_cost = sum(r.approximate_cost or 0.0 for r in records)

capture_llm_log_paths() — file paths. Wrap a scope to receive the per-call log-file path written by the configured file sink. Only a file sink (the default LocalYamlLogSink) yields a path; with a third-party sink, or with logging disabled, the list stays empty — reach for capture_llm_records() when you want cost/metadata regardless of the sink.

from llmkit import capture_llm_log_paths, structured_llm_call

with capture_llm_log_paths() as paths:
    result = await structured_llm_call(prompt, MySchema, feature="extraction")
# paths -> [PosixPath("data/llm-logs/...yaml"), ...]

Write your own LogSink

LogSink is a Protocol with a single, file-agnostic method:

class LogSink(Protocol):
    def write(self, record: LLMCallRecord) -> None: ...

A custom sink (a database, a metrics pipe, an in-memory buffer) is a one-method object that returns nothing; records (LLMCallRecord, a frozen dataclass) are handed to it for every call, and failures are swallowed so logging can never break a call. To send records somewhere other than local YAML — a database, an HTTP collector, structured stdout — implement write and register it:

import logging
from llmkit import LLMCallRecord, configure_llm_logging

logger = logging.getLogger("llm-calls")

class StructuredStdoutSink:
    def write(self, record: LLMCallRecord) -> None:
        logger.info(
            "llm_call",
            extra={
                "feature": record.feature,
                "label": record.label,
                "model": record.model,
                "provider": record.provider,
                "schema": record.schema,
                "duration_ms": record.duration_ms,
                "approximate_cost": record.approximate_cost,
                "error": record.error,
            },
        )

configure_llm_logging(StructuredStdoutSink())   # pass None to disable logging entirely

The shipped LocalYamlLogSink additionally exposes the path it wrote via its own write_returning_path(record) -> Path | None method — that file detail stays off the shared LogSink contract, and it is what powers capture_llm_log_paths() internally.

An OpenTelemetry exporter (e.g. to Langfuse/Phoenix) is a natural future llmkit[otel] extra; the pluggable seam makes it a non-breaking addition.

Configuration

LLMClientConfig is flat and carries only what a call needs:

@dataclass(frozen=True)
class LLMClientConfig:
    provider: Provider               # OPENROUTER | OLLAMA | GOOGLE | ANTHROPIC | OPENAI | DEEPSEEK | BEDROCK
    model: str | None = None         # None -> the provider's own default model
    api_key: str | None = None
    base_url: str | None = None      # OpenRouter / OpenAI-compatible endpoints; unused by Google/Anthropic
    reasoning_effort: str | None = None  # "disable" | "low" | "medium" | "high"
    aws_region_name: str | None = None   # AWS Bedrock region; unused by every other provider

aws_region_name is the only AWS-shaped field, and it carries only the region. AWS Bedrock authenticates through the standard AWS credential chain (environment, shared config, or instance/role), so Bedrock secrets never pass through LLMClientConfig; leave the region None too and it resolves from the chain (AWS_REGION_NAME / AWS_REGION). Bedrock routing needs boto3 for request signing — install it with the opt-in extra:

pip install "omg-llmkit[bedrock]"

The default model is Claude Haiku 4.5 via its cross-region inference profile id (us.anthropic.claude-haiku-4-5-20251001-v1:0) — current Claude models on Bedrock are typically reached through inference profiles rather than plain on-demand ids. Pass a different profile- or partition-prefixed id as model (e.g. eu.anthropic.claude-...) when your account routes elsewhere.

Per-call model= overrides the default, so "strong/small/current" model roles are the host's concern — resolve them to a model string and pass it at the call site. The library has no opinion about roles.

reasoning_effort controls provider "thinking"/reasoning tokens, forwarded to LiteLLM. Leave it None (the default) for the provider's own behaviour — the outbound request is byte-identical to omitting it. Set it once (e.g. "disable") and every call inherits it; the call functions also take a reasoning_effort= override for a single call. This matters most for Gemini, whose thinking is on by default and spends reasoning tokens against max_tokensreasoning_effort="disable" turns it off so a small max_tokens cap doesn't truncate structured output.

Register the config with configure_llm_client(source), where source is a zero-arg callable returning an LLMClientConfig (re-read on each provider construction, so it tracks live settings changes).

Constructing a provider for a per-call override

Most callers configure one provider once via configure_llm_client(...) and let every call pick it up. To send a single call through a different provider family, model, or credential, build a provider on the fly and pass it as the per-call provider= override. make_provider is the one-liner for that — it builds straight from raw credentials, with no LLMClientConfig and no module-level config source:

from llmkit import make_provider, structured_llm_call_sync, Provider

provider = make_provider(Provider.ANTHROPIC, api_key=anthropic_key)
result = structured_llm_call_sync(
    prompt,
    output_schema=MyModel,
    feature="summarize",
    provider=provider,
)

make_provider accepts the knobs each provider actually reads — api_key, model, base_url, reasoning_effort, aws_region_name — and ignores the ones a given provider doesn't use (e.g. base_url for Anthropic, api_key for Ollama or Bedrock, which signs via the ambient AWS credential chain). Leave model unset to inherit the provider's own default; the assembled LiteLLM id is always well-formed (e.g. anthropic/claude-sonnet-4-6).

A fully per-call host needs no global config at all. If you pass provider= on every call, you don't have to call configure_llm_client(...) — there is no global source to register, the call runs on the per-call provider alone, and the log records that provider as the effective one. The "configure once globally" and "provide per call" models are independent: use either, or mix them (a global default with per-call overrides where needed). A call that passes neither a per-call provider= nor a registered global source raises a clear RuntimeError telling you to configure one.

To build from a full config instead, use build_provider(config):

from llmkit import build_provider, LLMClientConfig, Provider

provider = build_provider(LLMClientConfig(provider=Provider.OPENAI, api_key=key))

LLMClientConfig.model is optional. When it is None (or empty), the provider falls back to its own built-in default model rather than emitting a broken "<prefix>/" id.

Naming: get_* reads, build_* / make_* construct

The accessor verbs are split by intent:

  • build_provider(config) / make_provider(...) construct a provider — from a config, or from raw credentials.
  • describe_llm(config) (importable from llmkit.providers) and get_rate_limit_config() read effective state — a snapshot for display/telemetry; they construct nothing you keep.

describe_llm replaces the old get_llm_config, and build_provider replaces get_provider; both old names are gone from the public surface.

OpenRouter: schema-honoring routing

OpenRouter is a router — it forwards your request to one of several serving providers behind each model. There's a sharp edge for structured output: structured_outputs is a model-level capability, but the strict response_format is actually enforced by the serving endpoint the request lands on. A model can advertise the capability while one of its endpoints quietly ignores the schema and returns free-form JSON — which then surfaces only as a confusing downstream validation failure, with nothing pointing at routing as the cause.

OpenRouterProvider defends against this by default: it sets OpenRouter's provider.require_parameters routing preference, so a request only lands on a serving endpoint that honors every parameter sent — including the structured response_format. The trade-off is that restricting routing to capable endpoints can in principle reduce availability or shift cost. To opt out (and accept the silent-free-form-JSON risk), construct the provider directly:

from llmkit import structured_llm_call
from llmkit.providers import OpenRouterProvider

provider = OpenRouterProvider(api_key="sk-or-...", require_parameters=False)
result = await structured_llm_call(prompt, MySchema, feature="x", provider=provider)

Routing stays on for the config-driven path (configure_llm_client / build_provider); the direct constructor above is the way to turn it off.

Retries

Two retry layers, kept deliberately separate:

  • Transient-provider retries, on by default. Every call function (structured_llm_call, structured_llm_call_sync, text_llm_call, text_llm_call_sync, stream_text_with_log) retries transient provider errors on its own — you don't wrap anything. The recoverable set splits into two budgets the policy counts separately:

    • Transport errors (LLM_TRANSPORT_ERRORS: 429 / 503 / 5xx, network/timeout) get the full max_attempts budget — three attempts by default — since a retry on a fresh connection routinely succeeds.
    • Schema-validation errors (LLM_SCHEMA_ERRORS: pydantic ValidationError, instructor InstructorRetryException) get the lower validation_max_attempts budget — two attempts (one retry) by default — so a transiently-malformed JSON response is still recovered, but a deterministically-wrong schema can't burn the full transport budget on doomed re-asks. (instructor wraps transport failures in InstructorRetryException too; the retry layer unwraps it, so a wrapped 429/5xx/network error still gets the full transport budget, not this lower one — and a wrapped permanent error such as a 401/400/403 fails fast after a single attempt, never charged to either budget.)

    LLM_RECOVERABLE_ERRORS remains the union of the two — keep using it in except clauses; the split only changes how the retry layer budgets them. One footnote: so that import llmkit doesn't pay LiteLLM's multi-second import cost, the litellm-native 503 entry (litellm.exceptions.ServiceUnavailableError) is a lazy stand-in resolved at isinstance time. isinstance checks — what the retry layer uses — behave identically, but a bare except LLM_TRANSPORT_ERRORS: / except LLM_RECOVERABLE_ERRORS: clause cannot catch that one litellm-native class (Python's except matching bypasses the lazy check); every other member still catches as usual, and an openai-SDK 503 arrives as openai.InternalServerError, which matches. Both budgets use bounded full-jitter backoff: the sleep before retry n is a random delay in [0, min(backoff_base_seconds * 2**(n-1), max_backoff_seconds)], with the per-sleep cap (max_backoff_seconds) defaulting to 30s so a large attempt budget can't grow the worst-case sleep unboundedly. Programming errors (e.g. TypeError) are outside the recoverable set and propagate immediately, never retried. Each attempt is its own logged call, so data/llm-logs/ shows one record per attempt.

    Tune or opt out per call with the retry= argument:

    from llmkit import structured_llm_call, RetryPolicy, NO_RETRY
    
    # Opt this one call out of automatic retries (e.g. latency-sensitive):
    result = await structured_llm_call(
        prompt="Summarize the attached report.",
        output_schema=Summary,
        feature="reports",
        label="exec_summary",
        retry=NO_RETRY,
    )
    
    # Or tune the budget / backoff for this call:
    result = await structured_llm_call(
        prompt="Summarize the attached report.",
        output_schema=Summary,
        feature="reports",
        label="exec_summary",
        retry=RetryPolicy(max_attempts=5, backoff_base_seconds=1.0),
    )
    

    Streaming caveat: stream_text_with_log can only retry a transient failure that happens before the first chunk reaches the caller. Once any chunk has been yielded, a mid-stream error propagates unretried — a partially-consumed stream can't be safely restarted.

    with_retries() (imported from llmkit.retry; see retry.py) remains the explicit, composable advanced path for wrapping any awaitable — useful when you want to retry a unit of work that isn't a single call function. The attempt count is max_attempts (total attempts including the first, N not 1+N); the previously-deprecated max_retries alias has been removed outright, so passing it now raises TypeError. Wrap a retry_progress_callback(...) scope around the work to observe per-attempt failures (e.g. for a progress UI):

    from llmkit.retry import with_retries
    from llmkit import LLM_TRANSPORT_ERRORS
    
    result = await with_retries(
        lambda: do_some_work(),
        max_attempts=3,
        backoff_base_seconds=0.5,
        retry_on=LLM_TRANSPORT_ERRORS,
    )
    

    A RetryProgressCallback is invoked once per non-final failed attempt with keyword arguments label, attempt, max_attempts, and error — the callback keyword is max_attempts (it was previously max_retries; rename it):

    def on_retry(*, label: str, attempt: int, max_attempts: int, error: BaseException) -> None:
        print(f"{label}: attempt {attempt}/{max_attempts} failed: {error}")
    

    Don't double-wrap the call functions. They already retry internally, so with_retries(structured_llm_call, ...) would otherwise multiply the budgets (the 3 × 3 = 9 trap). with_retries guards against this — it detects an active inner llmkit retry loop and collapses the inner layer to a single pass (warning once), so the budgets don't multiply. To drive retries entirely from your own wrapper instead, opt the inner call out with retry=NO_RETRY.

  • instructor's own in-call schema repair re-asks the model to fix malformed JSON within a single call, before any ValidationError/InstructorRetryException reaches the retry layer. llmkit pins instructor's max_retries to 2 — instructor counts total attempts, so that is two in-call attempts, i.e. exactly one repair re-ask — and it is not a caller-facing knob. This stays separate from the cross-call retry layer above: instructor repairs within one attempt; the policy's validation_max_attempts (default 2) governs how many fresh attempts a persistent schema failure earns. The two budgets are never conflated, so attempts aren't double-counted.

Re-rolling on a semantically-bad result

A response can pass the schema and still be wrong — an empty register, a citation that doesn't resolve, a total that doesn't reconcile. Rather than hand-rolling an LLM-then-validate-then-re-roll loop around the call, pass an on_result hook: it's called with each attempt's result, and raising ResultValidationError from it rejects that result and re-rolls the call.

from llmkit import structured_llm_call, ResultValidationError

def _must_have_findings(report: Report) -> None:
    if not report.findings:
        raise ResultValidationError("empty report — re-roll")

result = await structured_llm_call(
    prompt, Report, feature="reports", on_result=_must_have_findings,
)

The re-roll is charged against the validation budget (RetryPolicy.validation_max_attempts, default 2) — the same budget a schema failure uses, and for the same reason: a deterministically-bad result shouldn't burn the full transport budget on doomed re-asks. When the budget is exhausted the last ResultValidationError propagates. Each attempt — including a rejected one — is its own logged call, so data/llm-logs/ shows the rejected response alongside the error.

on_result is available on structured_llm_call and text_llm_call, and on both sync wrappers (structured_llm_call_sync, text_llm_call_sync); the text-path hooks receive the response text. It is not part of LLMCallOptions — like feature, it stays a conscious per-call choice.

Development

uv sync --extra dev
uv run ruff check . && uv run ruff format --check .
uv run basedpyright          # recommended tier; clean with no baseline
uv run pytest

Status & support

llmkit is a small, opinionated, best-effort project, extracted from a real application and maintained in the open. It is used in production by its author but carries no support SLA. Bug reports and focused pull requests are welcome — see CONTRIBUTING.md. For security issues, see SECURITY.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omg_llmkit-0.2.0.tar.gz (207.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omg_llmkit-0.2.0-py3-none-any.whl (107.7 kB view details)

Uploaded Python 3

File details

Details for the file omg_llmkit-0.2.0.tar.gz.

File metadata

  • Download URL: omg_llmkit-0.2.0.tar.gz
  • Upload date:
  • Size: 207.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for omg_llmkit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0fd626133cbcfba6bece35d1a7ebeb6ab15486c122f6926435dd9aaa5543d9ff
MD5 df1741c9bb313082e927de2ee8b54d7f
BLAKE2b-256 f892594d5e11bc74693ad6dec8cd3c356aa456edee1b8f06e092ae24dc60aa1a

See more details on using hashes here.

Provenance

The following attestation bundles were made for omg_llmkit-0.2.0.tar.gz:

Publisher: publish.yml on OMGBrews/llmkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file omg_llmkit-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: omg_llmkit-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 107.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for omg_llmkit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 66b3ffa7ae7f479c28e7c3e915687bd06938d1c803da18e69ad33f81b8c3670b
MD5 da71db84308274db28c821ae7fea38a5
BLAKE2b-256 8970a9298c7e47de52dd34c4995d7ad0fb8678f7b43def066ead21112a9c1a96

See more details on using hashes here.

Provenance

The following attestation bundles were made for omg_llmkit-0.2.0-py3-none-any.whl:

Publisher: publish.yml on OMGBrews/llmkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page