Skip to main content

Client and Tools for LLMs

Project description

llmskit

llmskit provides a unified Python interface for chat, embeddings, and reranking across multiple LLM providers.

The current codebase exposes:

  • Unified sync and async chat wrappers
  • OpenAI-style streaming and completion responses
  • Provider adapters for openai, gemini, and claude
  • Canonical multimodal message parts and tool definitions
  • OpenAI-compatible embeddings helpers
  • Generic reranker clients

Installation

pip install llmskit

Public API

from llmskit import (
    AsyncChatLLM,
    AsyncOpenAIEmbeddings,
    AsyncReranker,
    ChatLLM,
    OpenAIEmbeddings,
    Reranker,
)

Chat Quick Start

Synchronous chat

from llmskit import ChatLLM

chat = ChatLLM.from_openai(
    model="gpt-4o-mini",
    api_key="YOUR_API_KEY",
    base_url="https://api.openai.com/v1",  # replace with your OpenAI-compatible endpoint
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Introduce yourself in one sentence."},
]

response = chat.complete(messages=messages)
message = response["choices"][0]["message"]

print(message["content"])
print(message["reasoning_content"])
print(response["usage"])

ChatLLM is intended for blocking synchronous code paths and now uses native sync provider clients. In Jupyter notebooks, async web frameworks, or inside async def code, prefer AsyncChatLLM so you do not block the active event loop.

Asynchronous chat

import asyncio

from llmskit import AsyncChatLLM


async def main() -> None:
    chat = AsyncChatLLM.from_gemini(
        model="gemini-2.5-flash",
        api_key="YOUR_API_KEY",
    )

    response = await chat.complete(
        messages=[
            {"role": "system", "content": "Answer briefly."},
            {"role": "user", "content": "What is llmskit?"},
        ]
    )

    print(response["choices"][0]["message"]["content"])


asyncio.run(main())

If your runtime already has an event loop, such as Jupyter Notebook or FastAPI / Starlette request handlers, prefer AsyncChatLLM to keep that loop non-blocking.

Provider Factories

Use explicit factory methods when you already know the backend:

from llmskit import ChatLLM

openai_chat = ChatLLM.from_openai(
    model="gpt-4o-mini",
    api_key="YOUR_API_KEY",
    base_url="https://api.openai.com/v1",
)

gemini_chat = ChatLLM.from_gemini(
    model="gemini-2.5-flash",
    api_key="YOUR_API_KEY",
)

claude_chat = ChatLLM.from_claude(
    model="claude-sonnet-4-20250514",
    api_key="YOUR_API_KEY",
)

Claude requests require max_tokens at request time. Pass it to complete(...) or stream(...), not to the factory:

response = claude_chat.complete(
    messages=[{"role": "user", "content": "Summarize llmskit in one sentence."}],
    max_tokens=1024,
)

Or choose the provider dynamically:

from llmskit import ChatLLM

chat = ChatLLM.create(
    provider="openai",
    model="gpt-4o-mini",
    api_key="YOUR_API_KEY",
    base_url="https://api.openai.com/v1",
)

Supported provider names for create(...):

  • openai
  • gemini
  • claude

Deprecated aliases still exist in code, but new code should prefer:

  • from_openai(...) instead of from_gpt(...) or from_local(...)
  • from_claude(...) instead of from_anthropic(...)

Factory methods are for construction-time options such as base_url, client_logger, and retry_config. Request options such as temperature, max_tokens, or provider response_format belong on complete(...) / stream(...). Use result_format when you want llmskit itself to return the legacy compatibility object.

You can also register custom chat providers without editing llmskit.chat:

from typing import Any, AsyncIterator

from llmskit import AsyncChatLLM
from llmskit.clients import AsyncLLMClient
from llmskit.core import register_chat_provider
from llmskit.types import Message, ProviderEvent, ToolDefinition


class MyChatClient(AsyncLLMClient):
    provider = "my-provider"
    model = "demo-model"
    capabilities = {
        "tool_calling": False,
        "reasoning": False,
        "streaming": True,
        "vision": False,
        "audio_input": False,
        "audio_output": False,
        "document_input": False,
        "video_input": False,
        "native_multimodal_output": False,
    }

    async def events(
        self,
        messages: list[Message],
        *,
        tools: list[ToolDefinition] | None = None,
        **kwargs: Any,
    ) -> AsyncIterator[ProviderEvent]:
        del messages, tools, kwargs
        if False:  # pragma: no cover
            yield ProviderEvent()


register_chat_provider(name="my-provider", async_client_factory=MyChatClient, replace=True)
chat = AsyncChatLLM.create("my-provider", model="demo-model")

If you also want ChatLLM.create("my-provider", ...) support, register a native sync client with sync_client_factory=... as well.

If your custom provider needs per-model capability differences, declare provider_capability_defaults and model_capability_catalog on the client class. For OpenAI-compatible private models, you can also override the shared model capability snapshot via from_openai(..., capability_overrides={...}).

Response Formats

ChatLLM.complete(...) and AsyncChatLLM.complete(...) return an OpenAI-style response by default.

response = chat.complete(messages=messages)

print(response["object"])  # chat.completion
print(response["choices"][0]["message"]["content"])
print(response["choices"][0]["message"]["tool_calls"])
print(response["usage"])
print(response["provider_extensions"])

Most examples read choices[0] for convenience. When a provider returns multiple candidates, llmskit preserves them as separate OpenAI-style choices with their original index values. The legacy compatibility object exposes only the first choice.

If you still need the old compatibility object, request result_format="legacy":

legacy_response = chat.complete(
    messages=messages,
    result_format="legacy",
)

print(legacy_response.content)
print(legacy_response.reasoning_content)
print(legacy_response.tool_calls)

Provider request formatting still uses response_format, for example:

response = chat.complete(
    messages=messages,
    response_format={"type": "json_object"},
)

Provider-native request options should go inside provider_options, for example:

chat.complete(
    messages=messages,
    provider_options={"reasoning_effort": "high"},  # OpenAI native
)

chat.complete(
    messages=messages,
    provider_options={"thinking": {"type": "enabled", "budget_tokens": 1024}},  # Claude native
    max_tokens=2048,  # Claude requires this top-level request option
)

chat.complete(
    messages=messages,
    provider_options={"candidate_count": 2},  # Gemini native
)

Keep shared llmskit options such as temperature, max_tokens, modalities, audio, and response_format at the top level. Unknown top-level provider kwargs now raise a validation error instead of being silently ignored, and provider_options cannot override llmskit-managed keys such as model, messages, or stream.

Provider-specific boundaries:

Provider Top-level request options provider_options examples Notes
OpenAI-compatible temperature, max_tokens, top_p, stop, tool_choice, modalities, audio, response_format, timeout, extra_headers, extra_query, extra_body reasoning_effort, n, provider-native extensions n > 1 can return multiple choices; managed keys such as model, messages, stream, and tools are reserved.
Claude temperature, max_tokens, top_p, top_k, stop_sequences, metadata, timeout thinking and other Anthropic message options not managed by llmskit max_tokens is required and must be top-level; URL image/document sources are sent as {"type": "url", "url": ...}.
Gemini temperature, top_p, top_k, max_tokens, modalities, audio, response_format candidate_count, safety_settings, response_schema max_tokens maps to max_output_tokens; candidate_count > 1 can return multiple choices; response_format supports a MIME string or {"type": "json_object"}.

response_format="legacy" still works as a deprecated compatibility alias for older code, but new code should prefer result_format="legacy".

Streaming

stream(...) yields OpenAI-style chat completion chunks.

from llmskit import ChatLLM

chat = ChatLLM.from_openai(
    model="gpt-4o-mini",
    api_key="YOUR_API_KEY",
    base_url="https://api.openai.com/v1",
)

for chunk in chat.stream(
    messages=[{"role": "user", "content": "Count from 1 to 3."}]
):
    choice = chunk["choices"][0]
    delta = choice["delta"]

    if delta.get("role"):
        print("role:", delta["role"])
    if delta.get("content"):
        print(delta["content"], end="")
    if delta.get("reasoning_content"):
        print("\nreasoning:", delta["reasoning_content"])
    if delta.get("tool_calls"):
        print("\ntool_calls:", delta["tool_calls"])
    if choice.get("finish_reason"):
        print("\nfinish_reason:", choice["finish_reason"])

If you request multiple candidates, use choice["index"] to group streamed chunks by candidate.

Tool Calling

Tool definitions use one canonical schema across providers:

tools = [
    {
        "name": "get_weather",
        "description": "Get the weather for a city",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
            },
            "required": ["city"],
        },
    }
]

Pass them to complete(...) or stream(...):

response = chat.complete(
    messages=[{"role": "user", "content": "What is the weather in Beijing?"}],
    tools=tools,
)

tool_calls = response["choices"][0]["message"]["tool_calls"]
print(tool_calls)

Returned tool calls are normalized to an OpenAI-style structure:

[
    {
        "id": "call_123",
        "type": "function",
        "function": {
            "name": "get_weather",
            "arguments": "{\"city\":\"Beijing\"}",
        },
    }
]

Multimodal Messages

Message content can be either a plain string or a list of structured content parts.

Supported canonical content part types:

  • text
  • image_url
  • input_audio
  • file
  • video_url

Vision example

from llmskit import ChatLLM

chat = ChatLLM.from_gemini(
    model="gemini-2.5-flash",
    api_key="YOUR_API_KEY",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image."},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/cat.png",
                    "format": "image/png",
                },
            },
        ],
    }
]

response = chat.complete(messages=messages)
print(response["choices"][0]["message"]["content"])

Other content part shapes

audio_part = {
    "type": "input_audio",
    "input_audio": {
        "data": "<base64-audio-data>",
        "format": "wav",
    },
}

file_part = {
    "type": "file",
    "file": {
        "file_id": "gs://bucket/report.pdf",
        "format": "application/pdf",
    },
}

video_part = {
    "type": "video_url",
    "video_url": {
        "url": "gs://bucket/demo.mp4",
        "format": "video/mp4",
    },
}

The wrapper validates unsupported modalities early, so provider/model mismatches fail fast instead of being silently forwarded.

Inline binary data is validated strictly. Data URLs must include a ;base64 marker and a valid MIME type, such as data:image/png;base64,...; invalid base64 payloads are rejected before a provider request is made.

You can inspect capabilities at runtime:

print(chat.capabilities)
print(chat.capability_snapshot())
print(chat.refresh_capabilities())
print(chat.supports_vision())
print(chat.supports_audio_input())
print(chat.supports_audio_output())
print(chat.supports_document_input())
print(chat.supports_video_input())

Where:

  • chat.capabilities is the backward-compatible boolean view.
  • chat.capability_snapshot() returns a model-level snapshot with state / source metadata.
  • chat.refresh_capabilities() re-resolves the shared snapshot for the current provider + model + base_url tuple and preserves runtime-learned corrections by default.

Provider Capability Overview

The table below describes default model-family capabilities for built-in providers. At runtime, the authoritative behavior is the model-level capability snapshot, not class-level static constants:

Provider Tool calling Reasoning Vision Audio input Audio output Document input Video input
OpenAI-compatible Yes Yes Yes Yes Yes No No
Claude Yes Yes Yes No No Yes No
Gemini Yes Yes Yes Yes Yes Yes Yes

Embeddings

OpenAIEmbeddings and AsyncOpenAIEmbeddings target OpenAI-compatible embedding endpoints.

Synchronous embeddings

from llmskit import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    base_url="https://api.openai.com/v1",
    model="text-embedding-3-small",
    api_key="YOUR_API_KEY",
    batch_size=16,
)

query_vector = embeddings.embed_query("What is llmskit?")
document_vectors = embeddings.embed_documents(
    [
        "llmskit wraps multiple chat providers.",
        "It also includes embeddings and reranking helpers.",
    ]
)

print(len(query_vector))
print(len(document_vectors))
print(embeddings.get_embedding_dimension())

Asynchronous embeddings

import asyncio

from llmskit import AsyncOpenAIEmbeddings


async def main() -> None:
    embeddings = AsyncOpenAIEmbeddings(
        base_url="https://api.openai.com/v1",
        model="text-embedding-3-small",
        api_key="YOUR_API_KEY",
    )

    vector = await embeddings.embed_query("hello")
    print(len(vector))


asyncio.run(main())

Embedding helpers include:

  • batching
  • retry with exponential backoff
  • max input length truncation
  • cached dimension detection

Reranker

Reranker and AsyncReranker call a rerank service with a /rerank endpoint.

Synchronous reranking

from llmskit import Reranker

reranker = Reranker(
    base_url="https://your-reranker-service",
    model="bge-reranker-v2-m3",
    api_key="YOUR_API_KEY",
)

result = reranker.rerank(
    query="python async http client",
    documents=[
        "httpx supports both sync and async clients",
        "Redis is an in-memory database",
        "Python generators can yield values lazily",
    ],
    top_n=2,
    threshold=0.0,
)

print(result.results)
print(result.usage)

Asynchronous reranking

import asyncio

from llmskit import AsyncReranker


async def main() -> None:
    reranker = AsyncReranker(
        base_url="https://your-reranker-service",
        model="bge-reranker-v2-m3",
        api_key="YOUR_API_KEY",
    )

    result = await reranker.rerank(
        query="python async http client",
        documents=[
            "httpx supports both sync and async clients",
            "Redis is an in-memory database",
        ],
        top_n=1,
    )
    print(result.results)


asyncio.run(main())

Notes

  • ChatLLM and AsyncChatLLM normalize provider responses into OpenAI-style chunks and completion payloads.
  • The default non-streaming response is the new OpenAI-style dictionary, not the legacy dataclass.
  • OpenAIEmbeddings works with OpenAI-compatible services, including self-hosted endpoints that implement the embeddings API.
  • Retries are built in for transient network and server-side failures.
  • For local development and CI, run python -m pytest -q from the repository root.
  • The examples above use placeholder model names and endpoints; replace them with the values supported by your provider.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmskit-0.2.2.tar.gz (66.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmskit-0.2.2-py3-none-any.whl (69.4 kB view details)

Uploaded Python 3

File details

Details for the file llmskit-0.2.2.tar.gz.

File metadata

  • Download URL: llmskit-0.2.2.tar.gz
  • Upload date:
  • Size: 66.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for llmskit-0.2.2.tar.gz
Algorithm Hash digest
SHA256 903e8f0ea80eca3aa37dce4da4c228d2afe820bf4f6ff94b577110e3e230a200
MD5 40add896bffb1297dd5a933278fbc3d5
BLAKE2b-256 1f22bf74d297f9afd99febe661e54fdafec7e200ef0ef5f8a5f2d508a9aa99a6

See more details on using hashes here.

File details

Details for the file llmskit-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: llmskit-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 69.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for llmskit-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3360a6d9dc7eb3af0b38d181c3767f7d187fb7a59dad0dbedecdf9e9b20d6315
MD5 1030a141eef95946e17dce0534eab080
BLAKE2b-256 056636defa28a87e79a66a855647cd5d179bc405a42445d13ee01deff30dc0d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page