Add your description here
Project description
OmniModalKit
omnimodalkit is a lightweight Python toolkit for using text, image, and audio
capabilities behind a consistent OpenAI-style interface.
The current implementation focuses on OpenAI-compatible local providers,
especially llama.cpp, while keeping modality code separate from provider
transport code.
Install
This project uses uv.
uv sync
Optional audio extras:
uv sync --extra audio
uv sync --extra tts-piper
uv sync --extra audio --extra tts-piper
audioinstallsopenai-whisperfor speech-to-text.tts-piperinstallspiper-ttsfor local text-to-speech.
Whisper also needs ffmpeg available on PATH.
Basic Text Usage
from omnimodalkit import OmniModalKit
client = OmniModalKit(base_url="http://127.0.0.1:8080")
response = client.chat.completions.create(
messages=[{"role": "user", "content": "Say hi"}],
max_tokens=64,
)
print(response.text)
Convenience:
text = client.generate("Say hi", max_tokens=64)
Conversation Memory
Memory is explicit and caller-owned. Pass a ConversationMemory object to keep
chat history across requests. New request messages and the first assistant
response are appended after each call.
from omnimodalkit import ConversationMemory, OmniModalKit
client = OmniModalKit(base_url="http://127.0.0.1:8080")
memory = ConversationMemory.from_messages(
[{"role": "system", "content": "Be brief."}],
max_messages=12,
)
response = client.chat.completions.create(
messages=[{"role": "user", "content": "My name is Ada."}],
memory=memory,
)
response = client.chat.completions.create(
messages=[{"role": "user", "content": "What is my name?"}],
memory=memory,
)
print(response.text)
Convenience methods can use the same memory:
text = client.generate("What is my name?", memory=memory, max_tokens=64)
Streaming
Chat completions can stream OpenAI-compatible server-sent events:
chunks = client.chat.completions.create(
messages=[{"role": "user", "content": "Say hi"}],
stream=True,
)
for chunk in chunks:
print(chunk.text, end="")
Embeddings And Models
OpenAI-compatible embeddings and model listing are exposed through lightweight namespaces:
vectors = client.embeddings.create(input=["hello", "world"]).embeddings
models = client.models.list()
Convenience:
vectors = client.embed("hello")
Structured Output
Structured output helpers parse JSON objects from model text. Dataclass targets are supported without adding a required validation dependency.
from dataclasses import dataclass
@dataclass
class Answer:
name: str
count: int
answer = client.generate_structured(
"Return JSON with name and count.",
target=Answer,
)
Image Prompting
Image prompting uses OpenAI-compatible chat content parts. Local images are
converted to data URLs. For llama.cpp, image input is converted to PNG because
some server builds reject WebP data URLs.
response = client.image.create(
prompt="Briefly describe this image.",
image="dog.webp",
mime_type="image/webp",
)
print(response.text)
You can also convert an image path yourself:
from omnimodalkit.image import image_path_to_data_url
data_url = image_path_to_data_url(
"dog.webp",
convert_to_mime_type="image/png",
)
Speech-To-Text
Speech-to-text uses Whisper as an optional dependency.
text = client.audio.transcriptions.create(
file="audio.m4a",
model="base",
language="en",
)
Convenience:
text = client.transcribe_audio("audio.m4a", model="base", language="en")
Return Whisper metadata when needed:
result = client.transcribe_audio(
"audio.m4a",
model="base",
language="en",
return_metadata=True,
)
print(result.text)
print(result.segments)
The default Whisper model is base, which is a practical balance between
resource use and accuracy. Users can choose tiny, base, small, medium,
or large.
Text-To-Speech
Text-to-speech is engine-based. Piper is currently supported as an optional local engine.
from omnimodalkit.audio import PiperTextToSpeechEngine
engine = PiperTextToSpeechEngine(
model_path="en_US-amy-medium.onnx",
)
speech = client.audio.speech.create(
text="Hello from OmniModalKit.",
engine=engine,
)
audio_bytes = speech.audio
To save the audio:
speech.write_to("speech.wav")
Or pass output_path:
speech = client.synthesize_speech(
"Hello from OmniModalKit.",
engine=engine,
output_path="speech.wav",
)
LLM Response With Voice
get_response_with_voice queries the LLM first, then sends the text response to
the configured text-to-speech engine.
response = client.get_response_with_voice(
"Say one short sentence about multimodal tools.",
engine=engine,
output_path="response.wav",
)
print(response.text)
audio_bytes = response.speech.audio
Tools
Tools are explicit. OmniModalKit does not scan or execute files automatically. The host application registers approved functions and passes their schemas to the model.
from omnimodalkit import ToolRegistry
tools = ToolRegistry()
tools.register_from_path(
path="tools/web.py",
function_name="search",
name="web_search",
description="Search the web.",
parameters={
"type": "object",
"properties": {
"query": {"type": "string"},
},
"required": ["query"],
},
)
response = client.chat.completions.create(
messages=[{"role": "user", "content": "Search for Python news"}],
tools=tools.schemas(),
tool_choice="auto",
)
for call in response.tool_calls:
result = tools.run(call)
tool_message = tools.tool_result_message(call, result)
Append tool_message to the conversation and send another chat request to let
the model produce the final answer.
A one-step helper is available for applications that want OmniModalKit to handle the OpenAI message ordering while still using only explicitly registered tools:
result = client.run_tools_once(
messages=[{"role": "user", "content": "Search for Python news"}],
tools=tools,
)
print(result.text)
Async Usage
The async facade wraps the same provider behavior for applications that need
await-friendly calls:
response = await client.async_client.chat.completions.create(
messages=[{"role": "user", "content": "Say hi"}],
)
Provider Capabilities
Provider adapters expose basic capability metadata:
capabilities = client.capabilities()
print(capabilities.streaming, capabilities.embeddings)
Architecture
Current layout:
omnimodalkit/
client.py # public facade
memory.py # explicit in-memory conversation history
embeddings.py # embedding request/response models
models.py # model list response models
structured.py # JSON/dataclass output parsing
capabilities.py # provider feature flags
types.py # shared errors
tools.py # explicit tool registry
text/types.py # chat, response, tool-call models
image/types.py # image prompt/data URL helpers
audio/speech_to_text.py # Whisper-backed transcription helper
audio/text_to_speech.py # TTS request/result types and Piper engine
providers/base.py # provider protocols
providers/openai_compatible.py
providers/llama_cpp.py
Provider-specific code belongs under omnimodalkit/providers/. Modality
packages should contain request/response shaping and modality helpers, not
provider-specific adapters.
Tests
uv run pytest -q
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omnimodalkit-0.1.0.tar.gz.
File metadata
- Download URL: omnimodalkit-0.1.0.tar.gz
- Upload date:
- Size: 28.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5eebe4c02cb0bf78a6606c9e80ab36e6bf98bc14c9f70a0edb990d9e231098d
|
|
| MD5 |
dc7587dad6cd6c51bc17b2b0e1bdae2e
|
|
| BLAKE2b-256 |
e773e31ddb2c3f526ab36349927a28d59da885043576e2072eb2998e015e4ab7
|
File details
Details for the file omnimodalkit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: omnimodalkit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e02ec52233050e0c69359ca1c5eaef7102c1604c0062a1d6ae8f03de30b4838
|
|
| MD5 |
7993f3bedbda7121d996f4a35cb28e02
|
|
| BLAKE2b-256 |
34e91fc076bfafbb2f6ec0fe3854b49a918260a493fdcb80e8c560f6554c5a07
|