Skip to main content

Add your description here

Project description

OmniModalKit

omnimodalkit is a lightweight Python toolkit for using text, image, and audio capabilities behind a consistent OpenAI-style interface.

The current implementation focuses on OpenAI-compatible local providers, especially llama.cpp, while keeping modality code separate from provider transport code.

Install

This project uses uv.

uv sync

Optional audio extras:

uv sync --extra audio
uv sync --extra tts-piper
uv sync --extra audio --extra tts-piper
  • audio installs openai-whisper for speech-to-text.
  • tts-piper installs piper-tts for local text-to-speech.

Whisper also needs ffmpeg available on PATH.

Basic Text Usage

from omnimodalkit import OmniModalKit

client = OmniModalKit(base_url="http://127.0.0.1:8080")

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Say hi"}],
    max_tokens=64,
)

print(response.text)

Convenience:

text = client.generate("Say hi", max_tokens=64)

Conversation Memory

Memory is explicit and caller-owned. Pass a ConversationMemory object to keep chat history across requests. New request messages and the first assistant response are appended after each call.

from omnimodalkit import ConversationMemory, OmniModalKit

client = OmniModalKit(base_url="http://127.0.0.1:8080")
memory = ConversationMemory.from_messages(
    [{"role": "system", "content": "Be brief."}],
    max_messages=12,
)

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "My name is Ada."}],
    memory=memory,
)

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "What is my name?"}],
    memory=memory,
)

print(response.text)

Convenience methods can use the same memory:

text = client.generate("What is my name?", memory=memory, max_tokens=64)

Streaming

Chat completions can stream OpenAI-compatible server-sent events:

chunks = client.chat.completions.create(
    messages=[{"role": "user", "content": "Say hi"}],
    stream=True,
)

for chunk in chunks:
    print(chunk.text, end="")

Embeddings And Models

OpenAI-compatible embeddings and model listing are exposed through lightweight namespaces:

vectors = client.embeddings.create(input=["hello", "world"]).embeddings
models = client.models.list()

Convenience:

vectors = client.embed("hello")

Structured Output

Structured output helpers parse JSON objects from model text. Dataclass targets are supported without adding a required validation dependency.

from dataclasses import dataclass

@dataclass
class Answer:
    name: str
    count: int

answer = client.generate_structured(
    "Return JSON with name and count.",
    target=Answer,
)

Image Prompting

Image prompting uses OpenAI-compatible chat content parts. Local images are converted to data URLs. For llama.cpp, image input is converted to PNG because some server builds reject WebP data URLs.

response = client.image.create(
    prompt="Briefly describe this image.",
    image="dog.webp",
    mime_type="image/webp",
)

print(response.text)

You can also convert an image path yourself:

from omnimodalkit.image import image_path_to_data_url

data_url = image_path_to_data_url(
    "dog.webp",
    convert_to_mime_type="image/png",
)

Speech-To-Text

Speech-to-text uses Whisper as an optional dependency.

text = client.audio.transcriptions.create(
    file="audio.m4a",
    model="base",
    language="en",
)

Convenience:

text = client.transcribe_audio("audio.m4a", model="base", language="en")

Return Whisper metadata when needed:

result = client.transcribe_audio(
    "audio.m4a",
    model="base",
    language="en",
    return_metadata=True,
)

print(result.text)
print(result.segments)

The default Whisper model is base, which is a practical balance between resource use and accuracy. Users can choose tiny, base, small, medium, or large.

Text-To-Speech

Text-to-speech is engine-based. Piper is currently supported as an optional local engine.

from omnimodalkit.audio import PiperTextToSpeechEngine

engine = PiperTextToSpeechEngine(
    model_path="en_US-amy-medium.onnx",
)

speech = client.audio.speech.create(
    text="Hello from OmniModalKit.",
    engine=engine,
)

audio_bytes = speech.audio

To save the audio:

speech.write_to("speech.wav")

Or pass output_path:

speech = client.synthesize_speech(
    "Hello from OmniModalKit.",
    engine=engine,
    output_path="speech.wav",
)

LLM Response With Voice

get_response_with_voice queries the LLM first, then sends the text response to the configured text-to-speech engine.

response = client.get_response_with_voice(
    "Say one short sentence about multimodal tools.",
    engine=engine,
    output_path="response.wav",
)

print(response.text)
audio_bytes = response.speech.audio

Tools

Tools are explicit. OmniModalKit does not scan or execute files automatically. The host application registers approved functions and passes their schemas to the model.

from omnimodalkit import ToolRegistry

tools = ToolRegistry()

tools.register_from_path(
    path="tools/web.py",
    function_name="search",
    name="web_search",
    description="Search the web.",
    parameters={
        "type": "object",
        "properties": {
            "query": {"type": "string"},
        },
        "required": ["query"],
    },
)

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Search for Python news"}],
    tools=tools.schemas(),
    tool_choice="auto",
)

for call in response.tool_calls:
    result = tools.run(call)
    tool_message = tools.tool_result_message(call, result)

Append tool_message to the conversation and send another chat request to let the model produce the final answer.

A one-step helper is available for applications that want OmniModalKit to handle the OpenAI message ordering while still using only explicitly registered tools:

result = client.run_tools_once(
    messages=[{"role": "user", "content": "Search for Python news"}],
    tools=tools,
)

print(result.text)

Async Usage

The async facade wraps the same provider behavior for applications that need await-friendly calls:

response = await client.async_client.chat.completions.create(
    messages=[{"role": "user", "content": "Say hi"}],
)

Provider Capabilities

Provider adapters expose basic capability metadata:

capabilities = client.capabilities()
print(capabilities.streaming, capabilities.embeddings)

Architecture

Current layout:

omnimodalkit/
  client.py                  # public facade
  memory.py                  # explicit in-memory conversation history
  embeddings.py              # embedding request/response models
  models.py                  # model list response models
  structured.py              # JSON/dataclass output parsing
  capabilities.py            # provider feature flags
  types.py                   # shared errors
  tools.py                   # explicit tool registry
  text/types.py              # chat, response, tool-call models
  image/types.py             # image prompt/data URL helpers
  audio/speech_to_text.py    # Whisper-backed transcription helper
  audio/text_to_speech.py    # TTS request/result types and Piper engine
  providers/base.py          # provider protocols
  providers/openai_compatible.py
  providers/llama_cpp.py

Provider-specific code belongs under omnimodalkit/providers/. Modality packages should contain request/response shaping and modality helpers, not provider-specific adapters.

Tests

uv run pytest -q

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnimodalkit-0.1.0.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omnimodalkit-0.1.0-py3-none-any.whl (26.1 kB view details)

Uploaded Python 3

File details

Details for the file omnimodalkit-0.1.0.tar.gz.

File metadata

  • Download URL: omnimodalkit-0.1.0.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for omnimodalkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f5eebe4c02cb0bf78a6606c9e80ab36e6bf98bc14c9f70a0edb990d9e231098d
MD5 dc7587dad6cd6c51bc17b2b0e1bdae2e
BLAKE2b-256 e773e31ddb2c3f526ab36349927a28d59da885043576e2072eb2998e015e4ab7

See more details on using hashes here.

File details

Details for the file omnimodalkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: omnimodalkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for omnimodalkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5e02ec52233050e0c69359ca1c5eaef7102c1604c0062a1d6ae8f03de30b4838
MD5 7993f3bedbda7121d996f4a35cb28e02
BLAKE2b-256 34e91fc076bfafbb2f6ec0fe3854b49a918260a493fdcb80e8c560f6554c5a07

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page