Skip to main content

Unified AI model serving framework with API streaming support

Project description

isA_Model - AI Model Serving & Training Platform

CI Release Security Scan

Operators: see docs/PRODUCTION_READINESS.md for the component-by-component status matrix (what's actually deployed vs Helm-only vs planned).

A comprehensive Python platform for AI model serving, training, and optimization. Provides unified interface for multiple AI providers, intelligent model selection, LLM caching, multi-modal capabilities, and Lightning-based training workflows.

Current Version: 0.6.0

Table of Contents

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    isA_Model Platform                   │
│                                                             │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────┐ │
│  │  Model Serving  │  │  Lightning       │  │  Core       │ │
│  │                 │  │  Training       │  │  Services   │ │
│  │ • Multi-Provider│  │                 │  │ • Config     │ │
│  │ • LLM Caching   │  │ • APO/GRPO      │  │ • Discovery │ │
│  │ • Tool Calling   │  │ • Closed-Loop   │  │ • Logging    │ │
│  │ • Multi-Modal   │  │ • Custom        │  │ • Events     │ │
│  └─────────────────┘  └─────────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Core Components

1. AI Model Serving (isa_model/inference/)

  • Multi-Provider Support: OpenAI, Replicate, Ollama, Cerebras, OpenRouter
  • Intelligent Caching: Production-grade LLM caching with Redis backend
  • Tool Calling: OpenAI-compatible function calling interface
  • Multi-Modal: Text, Vision, Audio, Video, Embeddings
  • Streaming Support: Real-time streaming for all providers

2. Lightning Training (isa_model/training/lightning/)

  • Algorithm Framework: APO, GRPO, Closed-Loop, Custom algorithms
  • Data Pipeline: Automated trace collection and conversion
  • Job Management: RESTful API for training lifecycle
  • Event-Driven: NATS-based coordination and monitoring
  • Storage Abstraction: Memory and PostgreSQL backends

3. Core Services (isa_model/core/)

  • Configuration: Environment-based config management
  • Discovery: Consul-based service registration
  • Logging: Structured logging with Loki integration
  • Pricing: Cost tracking and optimization
  • Database: PostgreSQL gRPC client abstraction

4. Deployment (isa_model/deployment/)

  • Kubernetes: Production-ready K8s manifests
  • Docker: Multi-stage Dockerfiles for all components
  • Modal: Serverless deployment support
  • Triton: NVIDIA Triton Inference Server integration

Installation

Basic Installation

pip install isa_model

Installation with Optional Dependencies

# Cloud API providers (OpenAI, Replicate, Cerebras, Modal)
pip install isa_model[cloud]

# Local inference (PyTorch + transformers)
pip install isa_model[local]

# Audio processing
pip install isa_model[audio]

# Vision processing
pip install isa_model[vision]

# LangChain integration
pip install isa_model[langchain]

# Monitoring (MLflow, Prometheus, Redis)
pip install isa_model[monitoring]

# Full installation (all features)
pip install isa_model[all]

# Optimized for staging/production
pip install isa_model[staging]

Quick Start

Using the Async Client (Recommended)

The AsyncISAModel client provides an OpenAI-compatible interface:

from isa_model.inference_client import AsyncISAModel
import asyncio

async def main():
    async with AsyncISAModel(base_url="http://localhost:8082") as client:
        # Simple chat
        response = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": "Hello!"}]
        )
        print(response.choices[0].message.content)

        # Streaming chat
        stream = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": "Count to 5"}],
            stream=True
        )
        async for chunk in stream:
            if chunk.choices[0].delta.content:
                print(chunk.choices[0].delta.content, end="", flush=True)

asyncio.run(main())

Using AIFactory (Direct Service Access)

For more control, use the AIFactory to get service instances:

from isa_model.inference.ai_factory import AIFactory

factory = AIFactory.get_instance()

# Use OpenAI with API key
llm = factory.get_llm(
    model_name="gpt-4o-mini", 
    provider="openai", 
    api_key="your-openai-api-key-here"
)

# Use local Ollama model (no API key needed)
llm = factory.get_llm(model_name="llama3.1", provider="ollama")

Core Features

Multi-Modal AI Services

  • LLM (Text Generation): OpenAI (GPT-4, GPT-4o-mini), Ollama (Llama, Qwen), Cerebras, OpenRouter (DeepSeek-R1)
  • Vision: Image analysis (GPT-4o, ISA OmniParser), Image generation (DALL-E, Flux, Nano-Banana)
  • Audio: Speech-to-Text (Whisper, GPT-4o-transcribe), Text-to-Speech (OpenAI TTS, Replicate)
  • Video: Text-to-Video (ByteDance Seedance-1-Pro)
  • Embeddings: Text embeddings (OpenAI, Ollama), Document reranking (Jina Reranker v2)

Intelligent Features

  • Smart Model Selection: Automatically choose the best model based on task and input
  • LLM Caching: Two-layer cache (streaming + non-streaming) with 50-100x speedup
  • Tool Calling: Function calling with OpenAI-compatible interface
  • Streaming Support: Real-time streaming for all text generation
  • Format Negotiation: Supports OpenAI dict, LangChain message formats

Enterprise Features

  • Cost Tracking: Automatic cost calculation and tracking
  • Graceful Degradation: Cache failures don't break requests
  • Feature Flags: Environment-based feature control
  • Monitoring: Redis-backed metrics, hit rate tracking
  • Multi-Provider: Easy provider switching without code changes

API Client Usage

Comprehensive Example

See docs/guidance/examples/model_client_examples_async.py for complete examples covering:

from isa_model.inference_client import AsyncISAModel

async with AsyncISAModel(base_url="http://localhost:8082") as client:
    # 1. Simple chat
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello!"}]
    )
    
    # 2. Streaming chat
    stream = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Tell a story"}],
        stream=True
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content, end="")
    
    # 3. JSON mode (structured output)
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Generate a person profile"}],
        response_format={"type": "json_object"}
    )
    
    # 4. Function calling
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "What's the weather?"}],
        tools=[{
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    }
                }
            }
        }]
    )
    
    # 5. Vision analysis
    vision = await client.vision.completions.create(
        image="https://example.com/image.jpg",
        prompt="Describe this image",
        model="gpt-4o-mini",
        provider="openai"
    )
    
    # 6. Image generation
    image = await client.images.generate(
        prompt="A beautiful sunset over mountains",
        model="dall-e-3",
        size="1024x1024",
        provider="openai"
    )
    
    # 7. Embeddings
    embedding = await client.embeddings.create(
        input="This is a test sentence",
        model="text-embedding-3-small"
    )
    
    # 8. Speech-to-Text
    transcription = await client.audio.transcriptions.create(
        file="audio.wav",
        model="gpt-4o-mini-transcribe"
    )

Client Test Results

Async Client: 11/11 examples passed (100% success rate)
Sync Client: 5/9 attempted (streaming and TTS have limitations)

Recommendation: Always use AsyncISAModel for production workloads.

LLM Caching

NEW in v0.5.7: Production-grade LLM inference caching with Phase 2 implementation complete.

Features

  • Streaming Cache with Replay: 15ms per chunk delay for natural streaming feel
  • Non-Streaming Cache: Instant responses (~5ms vs 500ms)
  • Temperature-Based TTL: Smart expiration (temp=0 → 24h, temp=0.3 → 1h, temp=0.7 → 5min)
  • Graceful Degradation: Cache failure = automatic pass-through to LLM
  • Real-time Monitoring: Hit rate, replay stats, time saved tracking

Quick Setup

# Enable cache
export ENABLE_LLM_CACHE=true
export REDIS_HOST=localhost
export REDIS_PORT=50055

# Start service
python -m isa_model.serving.api.main

Performance Gains

Scenario First Request Cached Request Speedup Cost Saving
Non-streaming chat 500ms 5ms 100x 100%
Streaming chat 3000ms 2500ms 1.2x 100%
Code generation 2000ms 8ms 250x 100%

Expected Savings (40% hit rate, 1000 req/day):

  • Daily: $0.40
  • Monthly: $12
  • Annual: $144

For high-traffic systems (100K req/day): $1,200/month savings

Cache Management

# Get cache statistics
curl http://localhost:8082/api/v1/cache/stats

# Invalidate model cache (when model updates)
curl -X POST http://localhost:8082/api/v1/cache/invalidate/openai/gpt-4o-mini

# Clear all cache
curl -X POST http://localhost:8082/api/v1/cache/clear

See docs/CACHE_QUICKSTART.md for complete documentation.

DeepSeek-R1 Reasoning Model

NEW in v0.5.7: Support for DeepSeek-R1, a powerful reasoning model that shows its thought process.

Features

  • Visible Reasoning: See the model's thinking with show_reasoning=True
  • Streaming Tool Calling: Call tools while streaming reasoning process
  • Token Tracking: Separate tracking for reasoning tokens vs completion tokens
  • Cost Optimization: Reasoning tokens charged at input token rate ($0.55/1M)

Basic Usage

from isa_model.inference.ai_factory import AIFactory

factory = AIFactory()
llm = factory.get_llm(provider="openrouter", model_name="deepseek-r1")

# Without reasoning (only final answer)
response = await llm.ainvoke("If 2x + 5 = 11, what is x?", show_reasoning=False)

# With reasoning (see thought process)
response = await llm.ainvoke("If 2x + 5 = 11, what is x?", show_reasoning=True)
# Output includes: [思考: ...] tags showing reasoning steps

# Get token usage
usage = llm.get_last_token_usage()
print(f"Reasoning tokens: {usage['reasoning_tokens']}")
print(f"Completion tokens: {usage['completion_tokens']}")

Streaming with Reasoning

async for chunk in llm.astream("Calculate 15 × 23", show_reasoning=True):
    if chunk.startswith('[思考:') and chunk.endswith(']'):
        # Reasoning tokens (gray text)
        reasoning = chunk[4:-1]
        print(f"\033[90m{reasoning}\033[0m", end="", flush=True)
    else:
        # Normal content
        print(chunk, end="", flush=True)

See docs/guidance/examples/deepseek_r1_reasoning_example.py and docs/guidance/deepseek-r1.md for complete examples.

Multi-Modal Services

Speech-to-Text (4 Models)

# Basic transcription (fastest, cheapest)
transcription = await client.audio.transcriptions.create(
    file="audio.wav",
    model="gpt-4o-mini-transcribe"  # NEW default model
)

# High quality transcription
transcription = await client.audio.transcriptions.create(
    file="audio.wav",
    model="gpt-4o-transcribe"  # Highest quality
)

# With speaker diarization
transcription = await client.audio.transcriptions.create(
    file="audio.wav",
    model="gpt-4o-transcribe-diarize",
    enable_diarization=True,
    response_format="diarized_json"
)
# Returns: segments with speaker labels, timestamps

# Legacy Whisper model
transcription = await client.audio.transcriptions.create(
    file="audio.wav",
    model="whisper-1"  # Legacy
)

Video Generation

# Text-to-Video with ByteDance Seedance-1-Pro
response = await client._underlying_client.invoke(
    input_data="The sun rises slowly between tall buildings...",
    task="generate",
    service_type="video_generation",
    provider="replicate",
    model="seedance-1-pro",
    duration=5,
    fps=24,
    resolution="1080p",
    aspect_ratio="16:9"
)

Multi-Image Input

# Google Nano-Banana (Multi-Image Style Transfer)
response = await client._underlying_client.invoke(
    input_data="Make the sheets in the style of the logo",
    task="img2img",
    service_type="image_generation",
    provider="replicate",
    model="nano-banana",
    init_image=[
        "https://example.com/image1.png",
        "https://example.com/image2.png"
    ],
    aspect_ratio="match_input_image"
)

ISA Proprietary Services

# ISA OmniParser - UI Detection
vision = await client.vision.completions.create(
    image="https://example.com/ui-screenshot.jpg",
    prompt="Detect UI elements",
    model="isa-omniparser-ui-detection",
    provider="isa"
)

# ISA Jina Reranker v2 - Document Reranking
response = await client._underlying_client.invoke(
    input_data="What is machine learning?",
    task="rerank",
    service_type="embedding",
    provider="isa",
    model="isa-jina-reranker-v2-service",
    documents=[
        "Machine learning is a subset of AI...",
        "Python is a programming language...",
        "Neural networks are computational models..."
    ]
)

Tool Calling

OpenAI-Compatible Function Calling

from isa_model.inference_client import AsyncISAModel
import json

WEATHER_TOOL = {
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"}
            },
            "required": ["location"]
        }
    }
}

async with AsyncISAModel() as client:
    # Request with tool
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
        tools=[WEATHER_TOOL]
    )
    
    # Check if tool was called
    if response.choices[0].message.tool_calls:
        tool_call = response.choices[0].message.tool_calls[0]
        print(f"Tool: {tool_call.function.name}")
        print(f"Args: {tool_call.function.arguments}")
        
        # Execute tool (your implementation)
        args = json.loads(tool_call.function.arguments)
        result = get_weather(**args)
        
        # Continue conversation with tool result
        messages = [
            {"role": "user", "content": "What's the weather in Tokyo?"},
            {
                "role": "assistant",
                "tool_calls": [{
                    "id": tool_call.id,
                    "type": "function",
                    "function": {
                        "name": tool_call.function.name,
                        "arguments": tool_call.function.arguments
                    }
                }]
            },
            {
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result)
            }
        ]
        
        final = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )
        print(final.choices[0].message.content)

Streaming Tool Calling (DeepSeek-R1)

# Tool calls appear at the end of stream in delta.tool_calls
stream = await client.chat.completions.create(
    model="deepseek-r1",
    provider="openrouter",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=[WEATHER_TOOL],
    stream=True,
    show_reasoning=True
)

tool_calls = []
async for chunk in stream:
    delta = chunk.choices[0].delta
    
    # Collect reasoning and content
    if delta.content:
        print(delta.content, end="", flush=True)
    
    # Collect tool calls (appear at end)
    if delta.tool_calls:
        tool_calls.extend(delta.tool_calls)

# Execute tools after stream completes
for tc in tool_calls:
    args = json.loads(tc.function.arguments)
    result = execute_tool(**args)

See docs/guidance/examples/tool_call_streaming_example.py for complete agent loop implementation.

Examples

All runnable examples are in docs/guidance/examples/:

See docs/guidance/examples/README.md for detailed documentation.

Documentation

Comprehensive documentation is available in the docs/ directory:

docs/
├── overview/           → Project vision, goals, architecture
├── research/           → Research findings and exploration
├── domain/             → Domain concepts and knowledge models
├── prd/                → Product requirements documents
├── design/             → Technical design specifications
└── guidance/           → Developer guides and tutorials
    └── examples/       → Runnable Python scripts

Getting Started

Project Documentation

Development

Installing for Development

git clone <repository-url>
cd isA_Model

# Install with all dependencies
pip install -e ".[all]"

# Or install with specific extras
pip install -e ".[cloud,langchain,dev]"

Environment Setup

For local development, copy the example deployment env file into a gitignored local override:

cp deployment/environments/dev.env.example deployment/environments/dev.env
# or create deployment/environments/dev.local.env instead

Then fill in your local secrets, for example:

OPENAI_API_KEY=your-openai-key
REPLICATE_API_TOKEN=your-replicate-token
INTERNAL_SERVICE_SECRET=your-local-internal-secret

Running the Server

# Start the FastAPI server
python -m isa_model.serving.api.main

# Or with uvicorn
uvicorn isa_model.serving.api.fastapi_server:app --host 0.0.0.0 --port 8082

Running Tests

# Run async client examples (recommended)
python docs/guidance/examples/model_client_examples_async.py

# Run specific tests
python tests/test_stt_models.py

# Run cache tests
bash tests/cache_test.sh

Building and Publishing

# Update version in pyproject.toml
# Current version: 0.6.0

# Build the package
python -m build

# Upload to PyPI
python -m twine upload dist/isa_model-0.6.0* --username __token__ --password "$PYPI_API_TOKEN"

What's New in v0.5.7

LLM Caching (Phase 2 Complete)

  • Streaming Cache + Replay: Natural streaming feel with 15ms/chunk delay
  • Non-Streaming Cache: 100x speedup for deterministic queries
  • Temperature-Based TTL: Smart caching based on output randomness
  • Real-time Monitoring: Hit rate tracking, time saved statistics
  • Production Ready: Feature flags, graceful degradation, zero-impact deployment

DeepSeek-R1 Support

  • Visible Reasoning: See model's thought process with show_reasoning=True
  • Streaming Tool Calling: Function calling with reasoning visibility
  • Token Tracking: Separate reasoning token counting and cost tracking
  • Agent Loop Support: Complete multi-turn conversation with tools

Enhanced Multi-Modal

  • Speech-to-Text: 4 models (Whisper, gpt-4o-mini-transcribe, gpt-4o-transcribe, gpt-4o-transcribe-diarize)
  • Video Generation: ByteDance Seedance-1-Pro text-to-video
  • Multi-Image Input: Google Nano-Banana style transfer
  • ISA Services: OmniParser UI detection, Jina Reranker v2

Client Improvements

  • 100% Pass Rate: AsyncISAModel client (11/11 examples)
  • Format Negotiation: OpenAI dict + LangChain message support
  • Better Error Handling: Informative error messages and graceful failures
  • Resource Cleanup: Proper context manager support

Infrastructure

  • Consul Integration: Service discovery and dynamic routing
  • Redis Caching: Production-grade caching backend
  • Monitoring: Comprehensive metrics and logging
  • Feature Flags: Environment-based feature control

Supported Providers

Provider LLM Vision Audio Image Gen Video Embeddings
OpenAI
Replicate
Ollama
Cerebras
OpenRouter
ISA

Note: OpenRouter provider includes DeepSeek-R1 reasoning model.

Cost Optimization

LLM Caching Benefits

With 40% cache hit rate on 1,000 requests/day:

  • Daily savings: $0.40
  • Monthly savings: $12
  • Annual savings: $144

For high-traffic production (100K req/day):

  • Monthly savings: $1,200+

Model Selection Strategy

  • Development/Testing: Use gpt-4o-mini or ollama (local, free)
  • Production: Cache with temperature=0 for deterministic queries
  • Creative Tasks: Use higher temperature, shorter TTL
  • Code Generation: Cache aggressively (24h TTL for temp=0)

Architecture

isa_model/
├── client.py                  # Unified ISAModelClient
├── inference_client.py        # OpenAI-compatible client
├── inference/
│   ├── ai_factory.py         # Service factory
│   ├── services/             # Service implementations
│   │   ├── llm/             # LLM services
│   │   ├── vision/          # Vision services
│   │   ├── audio/           # Audio services (STT/TTS)
│   │   ├── img/             # Image generation
│   │   ├── video/           # Video generation
│   │   └── embedding/       # Embedding services
│   └── cache/               # LLM caching layer
├── serving/
│   └── api/                 # FastAPI server
├── core/
│   ├── config/              # Configuration management
│   ├── models/              # Model registry
│   └── services/            # Core services
└── deployment/              # Kubernetes, Docker configs

Roadmap

Phase 3: Semantic Caching (Planned)

  • Embedding-based similarity matching
  • Cache hits even with different wording
  • Target: 60-80% hit rate (vs 40% exact match)

Future Features

  • Cache warming on model updates
  • Distributed locking for multi-instance consistency
  • Per-user cache namespaces
  • A/B testing framework
  • Advanced cost analytics

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Submit a pull request

See CONTRIBUTING.md for detailed guidelines.

Support

  • Documentation: See docs/ directory
  • Examples: See docs/guidance/examples/ directory
  • Issues: Open an issue on GitHub
  • Discussions: GitHub Discussions

Acknowledgments

Built with:

  • FastAPI for high-performance API serving
  • Redis for production-grade caching
  • OpenAI SDK compatibility layer
  • LangChain integration support
  • Comprehensive provider ecosystem

Ready to get started? Check out docs/guidance/examples/ for comprehensive usage examples!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isa_model-0.6.0.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

isa_model-0.6.0-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file isa_model-0.6.0.tar.gz.

File metadata

  • Download URL: isa_model-0.6.0.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for isa_model-0.6.0.tar.gz
Algorithm Hash digest
SHA256 e223f28e56d22767224873f6a94ae25ff0b5ea08b0ab5b147be218c7fdd7dc1c
MD5 3e616cb04c51138dfddc4a3a62db0a5e
BLAKE2b-256 e91461447da312fd3bdf25e33a32e237212fbdeadb0c2df7e48d24409e68d2e4

See more details on using hashes here.

File details

Details for the file isa_model-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: isa_model-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for isa_model-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d4bccb779ce81c3e37985c31b7e77d34fbcfb6cbdba9f4a8df1445da74f12df0
MD5 f3c056b1a2aaf896208a07cf1ddfa314
BLAKE2b-256 44957e21baf4ab1f041548e5c4b5d6fa930c9cc73c05705c850e63ac7063f5b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page