Add your description here

Project description

OmniModalKit

omnimodalkit is a lightweight Python toolkit for using text, image, and audio capabilities behind a consistent OpenAI-style interface.

The current implementation focuses on OpenAI-compatible local providers, especially llama.cpp, while keeping modality code separate from provider transport code.

Install

This project uses uv.

uv sync

Optional audio extras:

uv sync --extra audio
uv sync --extra tts-piper
uv sync --extra audio --extra tts-piper

audio installs openai-whisper for speech-to-text.
tts-piper installs piper-tts for local text-to-speech.

Whisper also needs ffmpeg available on PATH.

Basic Text Usage

from omnimodalkit import OmniModalKit

client = OmniModalKit(base_url="http://127.0.0.1:8080")

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Say hi"}],
    max_tokens=64,
)

print(response.text)

Convenience:

text = client.generate("Say hi", max_tokens=64)

Conversation Memory

Memory is explicit and caller-owned. Pass a ConversationMemory object to keep chat history across requests. New request messages and the first assistant response are appended after each call.

from omnimodalkit import ConversationMemory, OmniModalKit

client = OmniModalKit(base_url="http://127.0.0.1:8080")
memory = ConversationMemory.from_messages(
    [{"role": "system", "content": "Be brief."}],
    max_messages=12,
)

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "My name is Ada."}],
    memory=memory,
)

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "What is my name?"}],
    memory=memory,
)

print(response.text)

Convenience methods can use the same memory:

text = client.generate("What is my name?", memory=memory, max_tokens=64)

Streaming

Chat completions can stream OpenAI-compatible server-sent events:

chunks = client.chat.completions.create(
    messages=[{"role": "user", "content": "Say hi"}],
    stream=True,
)

for chunk in chunks:
    print(chunk.text, end="")

Embeddings And Models

OpenAI-compatible embeddings and model listing are exposed through lightweight namespaces:

vectors = client.embeddings.create(input=["hello", "world"]).embeddings
models = client.models.list()

Convenience:

vectors = client.embed("hello")

Structured Output

Structured output helpers parse JSON objects from model text. Dataclass targets are supported without adding a required validation dependency.

from dataclasses import dataclass

@dataclass
class Answer:
    name: str
    count: int

answer = client.generate_structured(
    "Return JSON with name and count.",
    target=Answer,
)

Image Prompting

Image prompting uses OpenAI-compatible chat content parts. Local images are converted to data URLs. For llama.cpp, image input is converted to PNG because some server builds reject WebP data URLs.

response = client.image.create(
    prompt="Briefly describe this image.",
    image="dog.webp",
    mime_type="image/webp",
)

print(response.text)

You can also convert an image path yourself:

from omnimodalkit.image import image_path_to_data_url

data_url = image_path_to_data_url(
    "dog.webp",
    convert_to_mime_type="image/png",
)

Speech-To-Text

Speech-to-text uses Whisper as an optional dependency.

text = client.audio.transcriptions.create(
    file="audio.m4a",
    model="base",
    language="en",
)

Convenience:

text = client.transcribe_audio("audio.m4a", model="base", language="en")

Return Whisper metadata when needed:

result = client.transcribe_audio(
    "audio.m4a",
    model="base",
    language="en",
    return_metadata=True,
)

print(result.text)
print(result.segments)

The default Whisper model is base, which is a practical balance between resource use and accuracy. Users can choose tiny, base, small, medium, or large.

Text-To-Speech

Text-to-speech is engine-based. Piper is currently supported as an optional local engine.

from omnimodalkit.audio import PiperTextToSpeechEngine

engine = PiperTextToSpeechEngine(
    model_path="en_US-amy-medium.onnx",
)

speech = client.audio.speech.create(
    text="Hello from OmniModalKit.",
    engine=engine,
)

audio_bytes = speech.audio

To save the audio:

speech.write_to("speech.wav")

Or pass output_path:

speech = client.synthesize_speech(
    "Hello from OmniModalKit.",
    engine=engine,
    output_path="speech.wav",
)

LLM Response With Voice

get_response_with_voice queries the LLM first, then sends the text response to the configured text-to-speech engine.

response = client.get_response_with_voice(
    "Say one short sentence about multimodal tools.",
    engine=engine,
    output_path="response.wav",
)

print(response.text)
audio_bytes = response.speech.audio

Tools

Tools are explicit. OmniModalKit does not scan or execute files automatically. The host application registers approved functions and passes their schemas to the model.

from omnimodalkit import ToolRegistry

tools = ToolRegistry()

tools.register_from_path(
    path="tools/web.py",
    function_name="search",
    name="web_search",
    description="Search the web.",
    parameters={
        "type": "object",
        "properties": {
            "query": {"type": "string"},
        },
        "required": ["query"],
    },
)

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Search for Python news"}],
    tools=tools.schemas(),
    tool_choice="auto",
)

for call in response.tool_calls:
    result = tools.run(call)
    tool_message = tools.tool_result_message(call, result)

Append tool_message to the conversation and send another chat request to let the model produce the final answer.

A one-step helper is available for applications that want OmniModalKit to handle the OpenAI message ordering while still using only explicitly registered tools:

result = client.run_tools_once(
    messages=[{"role": "user", "content": "Search for Python news"}],
    tools=tools,
)

print(result.text)

Async Usage

The async facade wraps the same provider behavior for applications that need await-friendly calls:

response = await client.async_client.chat.completions.create(
    messages=[{"role": "user", "content": "Say hi"}],
)

Provider Capabilities

Provider adapters expose basic capability metadata:

capabilities = client.capabilities()
print(capabilities.streaming, capabilities.embeddings)

Architecture

Current layout:

omnimodalkit/
  client.py                  # public facade
  memory.py                  # explicit in-memory conversation history
  embeddings.py              # embedding request/response models
  models.py                  # model list response models
  structured.py              # JSON/dataclass output parsing
  capabilities.py            # provider feature flags
  types.py                   # shared errors
  tools.py                   # explicit tool registry
  text/types.py              # chat, response, tool-call models
  image/types.py             # image prompt/data URL helpers
  audio/speech_to_text.py    # Whisper-backed transcription helper
  audio/text_to_speech.py    # TTS request/result types and Piper engine
  providers/base.py          # provider protocols
  providers/openai_compatible.py
  providers/llama_cpp.py

Provider-specific code belongs under omnimodalkit/providers/. Modality packages should contain request/response shaping and modality helpers, not provider-specific adapters.

Tests

uv run pytest -q

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnimodalkit-0.1.0.tar.gz (28.2 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

omnimodalkit-0.1.0-py3-none-any.whl (26.1 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file omnimodalkit-0.1.0.tar.gz.

File metadata

Download URL: omnimodalkit-0.1.0.tar.gz
Upload date: Apr 29, 2026
Size: 28.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for omnimodalkit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f5eebe4c02cb0bf78a6606c9e80ab36e6bf98bc14c9f70a0edb990d9e231098d`
MD5	`dc7587dad6cd6c51bc17b2b0e1bdae2e`
BLAKE2b-256	`e773e31ddb2c3f526ab36349927a28d59da885043576e2072eb2998e015e4ab7`

See more details on using hashes here.

File details

Details for the file omnimodalkit-0.1.0-py3-none-any.whl.

File metadata

Download URL: omnimodalkit-0.1.0-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 26.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for omnimodalkit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5e02ec52233050e0c69359ca1c5eaef7102c1604c0062a1d6ae8f03de30b4838`
MD5	`7993f3bedbda7121d996f4a35cb28e02`
BLAKE2b-256	`34e91fc076bfafbb2f6ec0fe3854b49a918260a493fdcb80e8c560f6554c5a07`

See more details on using hashes here.

omnimodalkit 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

OmniModalKit

Install

Basic Text Usage

Conversation Memory

Streaming

Embeddings And Models

Structured Output

Image Prompting

Speech-To-Text

Text-To-Speech

LLM Response With Voice

Tools

Async Usage

Provider Capabilities

Architecture

Tests

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes