Unified AI model serving framework with API streaming support
Project description
isA_Model - AI Model Serving & Training Platform
Operators: see
docs/PRODUCTION_READINESS.mdfor the component-by-component status matrix (what's actually deployed vs Helm-only vs planned).
A comprehensive Python platform for AI model serving, training, and optimization. Provides unified interface for multiple AI providers, intelligent model selection, LLM caching, multi-modal capabilities, and Lightning-based training workflows.
Current Version: 0.6.0
Table of Contents
- Architecture Overview
- Core Components
- Installation
- Quick Start
- AI Model Serving
- Lightning Training
- Multi-Modal Services
- Examples
- Documentation
- Development
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ isA_Model Platform │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Model Serving │ │ Lightning │ │ Core │ │
│ │ │ │ Training │ │ Services │ │
│ │ • Multi-Provider│ │ │ │ • Config │ │
│ │ • LLM Caching │ │ • APO/GRPO │ │ • Discovery │ │
│ │ • Tool Calling │ │ • Closed-Loop │ │ • Logging │ │
│ │ • Multi-Modal │ │ • Custom │ │ • Events │ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Core Components
1. AI Model Serving (isa_model/inference/)
- Multi-Provider Support: OpenAI, Replicate, Ollama, Cerebras, OpenRouter
- Intelligent Caching: Production-grade LLM caching with Redis backend
- Tool Calling: OpenAI-compatible function calling interface
- Multi-Modal: Text, Vision, Audio, Video, Embeddings
- Streaming Support: Real-time streaming for all providers
2. Lightning Training (isa_model/training/lightning/)
- Algorithm Framework: APO, GRPO, Closed-Loop, Custom algorithms
- Data Pipeline: Automated trace collection and conversion
- Job Management: RESTful API for training lifecycle
- Event-Driven: NATS-based coordination and monitoring
- Storage Abstraction: Memory and PostgreSQL backends
3. Core Services (isa_model/core/)
- Configuration: Environment-based config management
- Discovery: Consul-based service registration
- Logging: Structured logging with Loki integration
- Pricing: Cost tracking and optimization
- Database: PostgreSQL gRPC client abstraction
4. Deployment (isa_model/deployment/)
- Kubernetes: Production-ready K8s manifests
- Docker: Multi-stage Dockerfiles for all components
- Modal: Serverless deployment support
- Triton: NVIDIA Triton Inference Server integration
Installation
Basic Installation
pip install isa_model
Installation with Optional Dependencies
# Cloud API providers (OpenAI, Replicate, Cerebras, Modal)
pip install isa_model[cloud]
# Local inference (PyTorch + transformers)
pip install isa_model[local]
# Audio processing
pip install isa_model[audio]
# Vision processing
pip install isa_model[vision]
# LangChain integration
pip install isa_model[langchain]
# Monitoring (MLflow, Prometheus, Redis)
pip install isa_model[monitoring]
# Full installation (all features)
pip install isa_model[all]
# Optimized for staging/production
pip install isa_model[staging]
Quick Start
Using the Async Client (Recommended)
The AsyncISAModel client provides an OpenAI-compatible interface:
from isa_model.inference_client import AsyncISAModel
import asyncio
async def main():
async with AsyncISAModel(base_url="http://localhost:8082") as client:
# Simple chat
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
# Streaming chat
stream = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Count to 5"}],
stream=True
)
async for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
asyncio.run(main())
Using AIFactory (Direct Service Access)
For more control, use the AIFactory to get service instances:
from isa_model.inference.ai_factory import AIFactory
factory = AIFactory.get_instance()
# Use OpenAI with API key
llm = factory.get_llm(
model_name="gpt-4o-mini",
provider="openai",
api_key="your-openai-api-key-here"
)
# Use local Ollama model (no API key needed)
llm = factory.get_llm(model_name="llama3.1", provider="ollama")
Core Features
Multi-Modal AI Services
- LLM (Text Generation): OpenAI (GPT-4, GPT-4o-mini), Ollama (Llama, Qwen), Cerebras, OpenRouter (DeepSeek-R1)
- Vision: Image analysis (GPT-4o, ISA OmniParser), Image generation (DALL-E, Flux, Nano-Banana)
- Audio: Speech-to-Text (Whisper, GPT-4o-transcribe), Text-to-Speech (OpenAI TTS, Replicate)
- Video: Text-to-Video (ByteDance Seedance-1-Pro)
- Embeddings: Text embeddings (OpenAI, Ollama), Document reranking (Jina Reranker v2)
Intelligent Features
- Smart Model Selection: Automatically choose the best model based on task and input
- LLM Caching: Two-layer cache (streaming + non-streaming) with 50-100x speedup
- Tool Calling: Function calling with OpenAI-compatible interface
- Streaming Support: Real-time streaming for all text generation
- Format Negotiation: Supports OpenAI dict, LangChain message formats
Enterprise Features
- Cost Tracking: Automatic cost calculation and tracking
- Graceful Degradation: Cache failures don't break requests
- Feature Flags: Environment-based feature control
- Monitoring: Redis-backed metrics, hit rate tracking
- Multi-Provider: Easy provider switching without code changes
API Client Usage
Comprehensive Example
See docs/guidance/examples/model_client_examples_async.py for complete examples covering:
from isa_model.inference_client import AsyncISAModel
async with AsyncISAModel(base_url="http://localhost:8082") as client:
# 1. Simple chat
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello!"}]
)
# 2. Streaming chat
stream = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Tell a story"}],
stream=True
)
async for chunk in stream:
print(chunk.choices[0].delta.content, end="")
# 3. JSON mode (structured output)
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Generate a person profile"}],
response_format={"type": "json_object"}
)
# 4. Function calling
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What's the weather?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}]
)
# 5. Vision analysis
vision = await client.vision.completions.create(
image="https://example.com/image.jpg",
prompt="Describe this image",
model="gpt-4o-mini",
provider="openai"
)
# 6. Image generation
image = await client.images.generate(
prompt="A beautiful sunset over mountains",
model="dall-e-3",
size="1024x1024",
provider="openai"
)
# 7. Embeddings
embedding = await client.embeddings.create(
input="This is a test sentence",
model="text-embedding-3-small"
)
# 8. Speech-to-Text
transcription = await client.audio.transcriptions.create(
file="audio.wav",
model="gpt-4o-mini-transcribe"
)
Client Test Results
Async Client: 11/11 examples passed (100% success rate)
Sync Client: 5/9 attempted (streaming and TTS have limitations)
Recommendation: Always use AsyncISAModel for production workloads.
LLM Caching
NEW in v0.5.7: Production-grade LLM inference caching with Phase 2 implementation complete.
Features
- Streaming Cache with Replay: 15ms per chunk delay for natural streaming feel
- Non-Streaming Cache: Instant responses (~5ms vs 500ms)
- Temperature-Based TTL: Smart expiration (temp=0 → 24h, temp=0.3 → 1h, temp=0.7 → 5min)
- Graceful Degradation: Cache failure = automatic pass-through to LLM
- Real-time Monitoring: Hit rate, replay stats, time saved tracking
Quick Setup
# Enable cache
export ENABLE_LLM_CACHE=true
export REDIS_HOST=localhost
export REDIS_PORT=50055
# Start service
python -m isa_model.serving.api.main
Performance Gains
| Scenario | First Request | Cached Request | Speedup | Cost Saving |
|---|---|---|---|---|
| Non-streaming chat | 500ms | 5ms | 100x | 100% |
| Streaming chat | 3000ms | 2500ms | 1.2x | 100% |
| Code generation | 2000ms | 8ms | 250x | 100% |
Expected Savings (40% hit rate, 1000 req/day):
- Daily: $0.40
- Monthly: $12
- Annual: $144
For high-traffic systems (100K req/day): $1,200/month savings
Cache Management
# Get cache statistics
curl http://localhost:8082/api/v1/cache/stats
# Invalidate model cache (when model updates)
curl -X POST http://localhost:8082/api/v1/cache/invalidate/openai/gpt-4o-mini
# Clear all cache
curl -X POST http://localhost:8082/api/v1/cache/clear
See docs/CACHE_QUICKSTART.md for complete documentation.
DeepSeek-R1 Reasoning Model
NEW in v0.5.7: Support for DeepSeek-R1, a powerful reasoning model that shows its thought process.
Features
- Visible Reasoning: See the model's thinking with
show_reasoning=True - Streaming Tool Calling: Call tools while streaming reasoning process
- Token Tracking: Separate tracking for reasoning tokens vs completion tokens
- Cost Optimization: Reasoning tokens charged at input token rate ($0.55/1M)
Basic Usage
from isa_model.inference.ai_factory import AIFactory
factory = AIFactory()
llm = factory.get_llm(provider="openrouter", model_name="deepseek-r1")
# Without reasoning (only final answer)
response = await llm.ainvoke("If 2x + 5 = 11, what is x?", show_reasoning=False)
# With reasoning (see thought process)
response = await llm.ainvoke("If 2x + 5 = 11, what is x?", show_reasoning=True)
# Output includes: [思考: ...] tags showing reasoning steps
# Get token usage
usage = llm.get_last_token_usage()
print(f"Reasoning tokens: {usage['reasoning_tokens']}")
print(f"Completion tokens: {usage['completion_tokens']}")
Streaming with Reasoning
async for chunk in llm.astream("Calculate 15 × 23", show_reasoning=True):
if chunk.startswith('[思考:') and chunk.endswith(']'):
# Reasoning tokens (gray text)
reasoning = chunk[4:-1]
print(f"\033[90m{reasoning}\033[0m", end="", flush=True)
else:
# Normal content
print(chunk, end="", flush=True)
See docs/guidance/examples/deepseek_r1_reasoning_example.py and docs/guidance/deepseek-r1.md for complete examples.
Multi-Modal Services
Speech-to-Text (4 Models)
# Basic transcription (fastest, cheapest)
transcription = await client.audio.transcriptions.create(
file="audio.wav",
model="gpt-4o-mini-transcribe" # NEW default model
)
# High quality transcription
transcription = await client.audio.transcriptions.create(
file="audio.wav",
model="gpt-4o-transcribe" # Highest quality
)
# With speaker diarization
transcription = await client.audio.transcriptions.create(
file="audio.wav",
model="gpt-4o-transcribe-diarize",
enable_diarization=True,
response_format="diarized_json"
)
# Returns: segments with speaker labels, timestamps
# Legacy Whisper model
transcription = await client.audio.transcriptions.create(
file="audio.wav",
model="whisper-1" # Legacy
)
Video Generation
# Text-to-Video with ByteDance Seedance-1-Pro
response = await client._underlying_client.invoke(
input_data="The sun rises slowly between tall buildings...",
task="generate",
service_type="video_generation",
provider="replicate",
model="seedance-1-pro",
duration=5,
fps=24,
resolution="1080p",
aspect_ratio="16:9"
)
Multi-Image Input
# Google Nano-Banana (Multi-Image Style Transfer)
response = await client._underlying_client.invoke(
input_data="Make the sheets in the style of the logo",
task="img2img",
service_type="image_generation",
provider="replicate",
model="nano-banana",
init_image=[
"https://example.com/image1.png",
"https://example.com/image2.png"
],
aspect_ratio="match_input_image"
)
ISA Proprietary Services
# ISA OmniParser - UI Detection
vision = await client.vision.completions.create(
image="https://example.com/ui-screenshot.jpg",
prompt="Detect UI elements",
model="isa-omniparser-ui-detection",
provider="isa"
)
# ISA Jina Reranker v2 - Document Reranking
response = await client._underlying_client.invoke(
input_data="What is machine learning?",
task="rerank",
service_type="embedding",
provider="isa",
model="isa-jina-reranker-v2-service",
documents=[
"Machine learning is a subset of AI...",
"Python is a programming language...",
"Neural networks are computational models..."
]
)
Tool Calling
OpenAI-Compatible Function Calling
from isa_model.inference_client import AsyncISAModel
import json
WEATHER_TOOL = {
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}
async with AsyncISAModel() as client:
# Request with tool
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=[WEATHER_TOOL]
)
# Check if tool was called
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
print(f"Tool: {tool_call.function.name}")
print(f"Args: {tool_call.function.arguments}")
# Execute tool (your implementation)
args = json.loads(tool_call.function.arguments)
result = get_weather(**args)
# Continue conversation with tool result
messages = [
{"role": "user", "content": "What's the weather in Tokyo?"},
{
"role": "assistant",
"tool_calls": [{
"id": tool_call.id,
"type": "function",
"function": {
"name": tool_call.function.name,
"arguments": tool_call.function.arguments
}
}]
},
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
}
]
final = await client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
print(final.choices[0].message.content)
Streaming Tool Calling (DeepSeek-R1)
# Tool calls appear at the end of stream in delta.tool_calls
stream = await client.chat.completions.create(
model="deepseek-r1",
provider="openrouter",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=[WEATHER_TOOL],
stream=True,
show_reasoning=True
)
tool_calls = []
async for chunk in stream:
delta = chunk.choices[0].delta
# Collect reasoning and content
if delta.content:
print(delta.content, end="", flush=True)
# Collect tool calls (appear at end)
if delta.tool_calls:
tool_calls.extend(delta.tool_calls)
# Execute tools after stream completes
for tc in tool_calls:
args = json.loads(tc.function.arguments)
result = execute_tool(**args)
See docs/guidance/examples/tool_call_streaming_example.py for complete agent loop implementation.
Examples
All runnable examples are in docs/guidance/examples/:
-
model_client_examples_async.py: Comprehensive async client examples (11/11 passed)
- Simple chat, streaming, multiple providers
- JSON mode, function calling
- Vision, embeddings, image generation
- Format negotiation, error handling
- Speech-to-Text, ISA services
-
model_client_examples_sync.py: Sync client (basic usage only, has limitations)
-
deepseek_r1_reasoning_example.py: DeepSeek-R1 reasoning examples
- Basic math, complex problems
- Streaming with reasoning
- Code generation, multi-turn chat
-
tool_call_streaming_example.py: Tool calling examples
- Basic streaming tool calls
- Complete agent loop
- DeepSeek-R1 reasoning + tools
-
nano_banana_example.py: Multi-image style transfer
-
seedance_video_example.py: Text-to-video generation
See docs/guidance/examples/README.md for detailed documentation.
Documentation
Comprehensive documentation is available in the docs/ directory:
docs/
├── overview/ → Project vision, goals, architecture
├── research/ → Research findings and exploration
├── domain/ → Domain concepts and knowledge models
├── prd/ → Product requirements documents
├── design/ → Technical design specifications
└── guidance/ → Developer guides and tutorials
└── examples/ → Runnable Python scripts
Getting Started
- Quick Start: Get started in 5 minutes
- LLM Services: Text generation and chat
- Tool Calling: Function calling guide
- Providers: Configure model providers
- Caching: Cache optimization
- DeepSeek R1: Reasoning model with tool calls
Project Documentation
- Project Overview: Vision, goals, architecture
- Product Requirements: Feature specifications
- Technical Design: System design documents
Development
Installing for Development
git clone <repository-url>
cd isA_Model
# Install with all dependencies
pip install -e ".[all]"
# Or install with specific extras
pip install -e ".[cloud,langchain,dev]"
Environment Setup
For local development, copy the example deployment env file into a gitignored local override:
cp deployment/environments/dev.env.example deployment/environments/dev.env
# or create deployment/environments/dev.local.env instead
Then fill in your local secrets, for example:
OPENAI_API_KEY=your-openai-key
REPLICATE_API_TOKEN=your-replicate-token
INTERNAL_SERVICE_SECRET=your-local-internal-secret
Running the Server
# Start the FastAPI server
python -m isa_model.serving.api.main
# Or with uvicorn
uvicorn isa_model.serving.api.fastapi_server:app --host 0.0.0.0 --port 8082
Running Tests
# Run async client examples (recommended)
python docs/guidance/examples/model_client_examples_async.py
# Run specific tests
python tests/test_stt_models.py
# Run cache tests
bash tests/cache_test.sh
Building and Publishing
# Update version in pyproject.toml
# Current version: 0.6.0
# Build the package
python -m build
# Upload to PyPI
python -m twine upload dist/isa_model-0.6.0* --username __token__ --password "$PYPI_API_TOKEN"
What's New in v0.5.7
LLM Caching (Phase 2 Complete)
- Streaming Cache + Replay: Natural streaming feel with 15ms/chunk delay
- Non-Streaming Cache: 100x speedup for deterministic queries
- Temperature-Based TTL: Smart caching based on output randomness
- Real-time Monitoring: Hit rate tracking, time saved statistics
- Production Ready: Feature flags, graceful degradation, zero-impact deployment
DeepSeek-R1 Support
- Visible Reasoning: See model's thought process with
show_reasoning=True - Streaming Tool Calling: Function calling with reasoning visibility
- Token Tracking: Separate reasoning token counting and cost tracking
- Agent Loop Support: Complete multi-turn conversation with tools
Enhanced Multi-Modal
- Speech-to-Text: 4 models (Whisper, gpt-4o-mini-transcribe, gpt-4o-transcribe, gpt-4o-transcribe-diarize)
- Video Generation: ByteDance Seedance-1-Pro text-to-video
- Multi-Image Input: Google Nano-Banana style transfer
- ISA Services: OmniParser UI detection, Jina Reranker v2
Client Improvements
- 100% Pass Rate: AsyncISAModel client (11/11 examples)
- Format Negotiation: OpenAI dict + LangChain message support
- Better Error Handling: Informative error messages and graceful failures
- Resource Cleanup: Proper context manager support
Infrastructure
- Consul Integration: Service discovery and dynamic routing
- Redis Caching: Production-grade caching backend
- Monitoring: Comprehensive metrics and logging
- Feature Flags: Environment-based feature control
Supported Providers
| Provider | LLM | Vision | Audio | Image Gen | Video | Embeddings |
|---|---|---|---|---|---|---|
| OpenAI | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| Replicate | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| Ollama | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| Cerebras | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| OpenRouter | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| ISA | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ |
Note: OpenRouter provider includes DeepSeek-R1 reasoning model.
Cost Optimization
LLM Caching Benefits
With 40% cache hit rate on 1,000 requests/day:
- Daily savings: $0.40
- Monthly savings: $12
- Annual savings: $144
For high-traffic production (100K req/day):
- Monthly savings: $1,200+
Model Selection Strategy
- Development/Testing: Use
gpt-4o-miniorollama(local, free) - Production: Cache with
temperature=0for deterministic queries - Creative Tasks: Use higher temperature, shorter TTL
- Code Generation: Cache aggressively (24h TTL for temp=0)
Architecture
isa_model/
├── client.py # Unified ISAModelClient
├── inference_client.py # OpenAI-compatible client
├── inference/
│ ├── ai_factory.py # Service factory
│ ├── services/ # Service implementations
│ │ ├── llm/ # LLM services
│ │ ├── vision/ # Vision services
│ │ ├── audio/ # Audio services (STT/TTS)
│ │ ├── img/ # Image generation
│ │ ├── video/ # Video generation
│ │ └── embedding/ # Embedding services
│ └── cache/ # LLM caching layer
├── serving/
│ └── api/ # FastAPI server
├── core/
│ ├── config/ # Configuration management
│ ├── models/ # Model registry
│ └── services/ # Core services
└── deployment/ # Kubernetes, Docker configs
Roadmap
Phase 3: Semantic Caching (Planned)
- Embedding-based similarity matching
- Cache hits even with different wording
- Target: 60-80% hit rate (vs 40% exact match)
Future Features
- Cache warming on model updates
- Distributed locking for multi-instance consistency
- Per-user cache namespaces
- A/B testing framework
- Advanced cost analytics
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a pull request
See CONTRIBUTING.md for detailed guidelines.
Support
- Documentation: See
docs/directory - Examples: See
docs/guidance/examples/directory - Issues: Open an issue on GitHub
- Discussions: GitHub Discussions
Acknowledgments
Built with:
- FastAPI for high-performance API serving
- Redis for production-grade caching
- OpenAI SDK compatibility layer
- LangChain integration support
- Comprehensive provider ecosystem
Ready to get started? Check out docs/guidance/examples/ for comprehensive usage examples!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file isa_model-0.6.0.tar.gz.
File metadata
- Download URL: isa_model-0.6.0.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e223f28e56d22767224873f6a94ae25ff0b5ea08b0ab5b147be218c7fdd7dc1c
|
|
| MD5 |
3e616cb04c51138dfddc4a3a62db0a5e
|
|
| BLAKE2b-256 |
e91461447da312fd3bdf25e33a32e237212fbdeadb0c2df7e48d24409e68d2e4
|
File details
Details for the file isa_model-0.6.0-py3-none-any.whl.
File metadata
- Download URL: isa_model-0.6.0-py3-none-any.whl
- Upload date:
- Size: 1.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4bccb779ce81c3e37985c31b7e77d34fbcfb6cbdba9f4a8df1445da74f12df0
|
|
| MD5 |
f3c056b1a2aaf896208a07cf1ddfa314
|
|
| BLAKE2b-256 |
44957e21baf4ab1f041548e5c4b5d6fa930c9cc73c05705c850e63ac7063f5b3
|