One tiny model, every LLM API. Drop-in test server for OpenAI, Anthropic, Bedrock, and Vertex.
Project description
LLM Katan - Lightweight LLM Server for Testing
A lightweight LLM serving package using FastAPI and HuggingFace transformers, designed for testing and development with real tiny models.
🎬 See Live Demo Interactive terminal showing multi-instance setup in action!
Features
- 🚀 FastAPI-based: High-performance async web server
- 🤗 HuggingFace Integration: Real model inference with transformers
- ⚡ Tiny Models: Ultra-lightweight models for fast testing (Qwen3-0.6B, etc.)
- 🔄 Multi-Provider: Serve the same model as OpenAI, Anthropic, and more (Bedrock, Vertex coming soon)
- 🎯 API Compatible: Drop-in replacement for provider endpoints with correct response formats
- 🔐 Auth Validation: Each provider requires its native auth header (e.g.,
Authorizationfor OpenAI,x-api-keyfor Anthropic) - 📦 PyPI Ready: Easy installation and distribution
- 🛠️ vLLM Support: Optional vLLM backend for production-like performance
Quick Start
Installation
Option 1: PyPI
pip install llm-katan
Option 2: Docker
# Pull and run the latest Docker image
docker pull ghcr.io/yossiovadia/llm-katan/llm-katan:latest
docker run -p 8000:8000 ghcr.io/yossiovadia/llm-katan/llm-katan:latest
# Or with custom model
docker run -p 8000:8000 ghcr.io/yossiovadia/llm-katan/llm-katan:latest \
llm-katan --served-model-name "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
Setup
HuggingFace Token (Required)
LLM Katan uses HuggingFace transformers to download models. You'll need a HuggingFace token for:
- Private models
- Avoiding rate limits
- Reliable model downloads
Option 1: Environment Variable
export HUGGINGFACE_HUB_TOKEN="your_token_here"
Option 2: Login via CLI
huggingface-cli login
Option 3: Token file in home directory
# Create ~/.cache/huggingface/token file with your token
echo "your_token_here" > ~/.cache/huggingface/token
Get your token: Visit https://huggingface.co/settings/tokens
Basic Usage
# Echo mode — no model download, instant startup, no torch needed
llm-katan --model my-test-model --backend echo --providers openai,anthropic
# Start server with a tiny model (quantization enabled by default for speed)
llm-katan --model Qwen/Qwen3-0.6B --port 8000
# Start with custom served model name
llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# Disable quantization for higher accuracy (slower)
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --no-quantize
# With vLLM backend (optional)
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --backend vllm
Multi-Instance Testing
🎬 Live Demo See this in action with animated terminals!
Note: If GitHub Pages isn't enabled, you can also download and open the demo locally
📺 Preview (click to expand)
# Terminal 1: Installing and starting GPT-3.5-Turbo mock
$ pip install llm-katan
Successfully installed llm-katan-0.1.8
$ llm-katan --model Qwen/Qwen3-0.6B --port 8000 --served-model-name "gpt-3.5-turbo"
🚀 Starting LLM Katan server with model: Qwen/Qwen3-0.6B
📛 Served model name: gpt-3.5-turbo
✅ Server running on http://0.0.0.0:8000
# Terminal 2: Starting Claude-3-Haiku mock
$ llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name "claude-3-haiku"
🚀 Starting LLM Katan server with model: Qwen/Qwen3-0.6B
📛 Served model name: claude-3-haiku
✅ Server running on http://0.0.0.0:8001
# Terminal 3: Testing both endpoints
$ curl localhost:8000/v1/models | jq '.data[0].id'
"gpt-3.5-turbo"
$ curl localhost:8001/v1/models | jq '.data[0].id'
"claude-3-haiku"
# Same tiny model, different API names! 🎯
# Terminal 1: Mock GPT-3.5-Turbo
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --served-model-name "gpt-3.5-turbo"
# Terminal 2: Mock Claude-3-Haiku
llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name "claude-3-haiku"
# Terminal 3: Test both endpoints
curl http://localhost:8000/v1/models # Returns "gpt-3.5-turbo"
curl http://localhost:8001/v1/models # Returns "claude-3-haiku"
Perfect for testing multi-provider scenarios with one tiny model!
How It Works
llm-katan does not proxy requests to real providers. There is no OpenAI SDK, no Anthropic SDK, no cloud API calls. Instead, each provider is a thin formatting layer around the same local model:
Request (any provider format)
|
v
Provider (openai.py / anthropic.py / ...)
- Parses the provider-specific request format
- Extracts: messages, max_tokens, temperature
- Normalizes to plain Python dicts
|
v
Backend (model.py)
- Converts messages to a prompt string
- Feeds it directly to the local model (e.g., Qwen3-0.6B)
- Returns: generated text + token counts
|
v
Provider (same one that handled the request)
- Wraps the raw text in the provider's native response format
- Returns to client
So Anthropic format in, Anthropic format out. OpenAI format in, OpenAI format out. The backend has zero knowledge of any provider format — it just generates text. No translation chain, no provider in the middle.
API Endpoints
Shared:
GET /health- Health check (shows active providers)GET /metrics- Prometheus metrics
OpenAI (--providers openai):
GET /v1/models- List available modelsPOST /v1/chat/completions- Chat completions (OpenAI compatible)
Anthropic (--providers anthropic):
POST /v1/messages- Messages API (Anthropic compatible, with SSE streaming)
Vertex AI / Gemini (--providers vertexai):
POST /v1beta/models/{model}:generateContent- Generate contentPOST /v1beta/models/{model}:streamGenerateContent- Streaming generate content- Also supports
/v1/prefix
AWS Bedrock (--providers bedrock):
POST /model/{modelId}/converse- Converse API (unified, model-agnostic)POST /model/{modelId}/converse-stream- Streaming Converse APIPOST /model/{modelId}/invoke- InvokeModel (Anthropic Claude format when model ID contains "anthropic" or "claude", Amazon Titan format otherwise)
Enable all providers at once: --providers openai,anthropic,vertexai,bedrock
Example API Usage
# Basic chat completion
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2-0.5B-Instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 50,
"temperature": 0.7
}'
# Creative writing example
curl -X POST http://127.0.0.1:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [
{"role": "user", "content": "Write a short poem about coding"}
],
"max_tokens": 100,
"temperature": 0.8
}'
# Check available models
curl http://127.0.0.1:8000/v1/models
# Health check
curl http://127.0.0.1:8000/health
# Anthropic Messages API
curl -X POST http://127.0.0.1:8000/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: test-key" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-test",
"max_tokens": 50,
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
# Vertex AI / Gemini API
curl -X POST http://127.0.0.1:8000/v1beta/models/gemini-pro:generateContent \
-H "Content-Type: application/json" \
-H "Authorization: Bearer test-token" \
-d '{
"contents": [
{"role": "user", "parts": [{"text": "What is the capital of France?"}]}
]
}'
# AWS Bedrock Converse API
curl -X POST http://127.0.0.1:8000/model/anthropic.claude-v2/converse \
-H "Content-Type: application/json" \
-H "Authorization: AWS4-HMAC-SHA256 Credential=test" \
-d '{
"messages": [
{"role": "user", "content": [{"text": "What is the capital of France?"}]}
]
}'
# AWS Bedrock InvokeModel (Anthropic Claude)
curl -X POST http://127.0.0.1:8000/model/anthropic.claude-v2/invoke \
-H "Content-Type: application/json" \
-H "Authorization: AWS4-HMAC-SHA256 Credential=test" \
-d '{
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 100,
"messages": [{"role": "user", "content": "Hello"}]
}'
CPU Optimization
LLM Katan includes automatic int8 quantization for CPU inference, providing significant performance improvements:
Performance Gains
- 2-4x faster inference on CPU (on supported platforms)
- 4x memory reduction
- Enabled by default for best testing experience
- Minimal quality impact (acceptable for testing scenarios)
- Platform support: Works best on Linux x86_64; may not be available on all platforms (e.g., Mac)
When to Use Quantization
✅ Enabled (default) - Recommended for:
- Fast E2E testing
- Development environments
- CI/CD pipelines
- Resource-constrained environments
❌ Disabled (--no-quantize) - Use when you need:
- Maximum accuracy (though tiny models have limited accuracy anyway)
- Debugging precision-sensitive issues
- Comparing with full-precision baselines
Example Performance
# Default: Fast with quantization (~50-100s per inference)
llm-katan --model Qwen/Qwen3-0.6B
# Slower but more accurate (~200s per inference)
llm-katan --model Qwen/Qwen3-0.6B --no-quantize
Note: Even with quantization, llm-katan is slower than production tools like LM Studio (which uses llama.cpp with extensive optimizations). For production workloads, use vLLM, Ollama, or similar solutions.
Use Cases
Strengths
- Fastest time-to-test: 30 seconds from install to running
- Optimized for CPU: Automatic int8 quantization for 2-4x speedup
- Minimal resource footprint: Designed for tiny models and efficient testing
- No GPU required: Runs on laptops, Macs, and any CPU-only environment
- CI/CD integration friendly: Lightweight and automation-ready
- Multiple instances: Run same model with different names on different ports
Ideal For
- Automated testing pipelines: Quick LLM endpoint setup for test suites
- Development environment mocking: Real inference without production overhead
- Quick prototyping: Fast iteration with actual model behavior
- Educational/learning scenarios: Easy setup for AI development learning
Not Ideal For
- Production workloads: Use Ollama or vLLM for production deployments
- Large model serving: Designed for tiny models (< 1B parameters)
- Complex multi-agent workflows: Use Semantic Kernel or similar frameworks
- High-performance inference: Use vLLM or specialized serving solutions
Configuration
Command Line Options
# All available options
llm-katan [OPTIONS]
Required:
-m, --model TEXT Model name to load (e.g., 'Qwen/Qwen3-0.6B') [required]
Optional:
-n, --name, --served-model-name TEXT
Model name to serve via API (defaults to model name)
-p, --port INTEGER Port to serve on (default: 8000)
-h, --host TEXT Host to bind to (default: 0.0.0.0)
-b, --backend [transformers|vllm|echo] Backend to use (default: transformers)
--max, --max-tokens INTEGER Maximum tokens to generate (default: 512)
-t, --temperature FLOAT Sampling temperature (default: 0.7)
-d, --device [auto|cpu|cuda] Device to use (default: auto)
--quantize/--no-quantize Enable int8 quantization for faster CPU inference (default: enabled)
--providers TEXT Comma-separated providers to enable (default: openai)
--max-concurrent INTEGER Max concurrent inference requests (default: 1)
--log-level [debug|info|warning|error] Log level (default: INFO)
--version Show version and exit
--help Show help and exit
Advanced Usage Examples
# Serve all provider endpoints (auth always required per provider)
llm-katan --model Qwen/Qwen3-0.6B --providers openai,anthropic,vertexai,bedrock
# Custom generation settings
llm-katan --model Qwen/Qwen3-0.6B --max-tokens 1024 --temperature 0.9
# Force specific device with full precision (no quantization)
llm-katan --model Qwen/Qwen3-0.6B --device cpu --no-quantize --log-level debug
# Custom host and port
llm-katan --model Qwen/Qwen3-0.6B --host 127.0.0.1 --port 9000
# Multiple servers with different settings
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --max-tokens 512 --temperature 0.1
llm-katan --model Qwen/Qwen3-0.6B --port 8001 \
--name "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --max-tokens 256 --temperature 0.9
Environment Variables
LLM_KATAN_MODEL: Default model to loadLLM_KATAN_PORT: Default port (8000)LLM_KATAN_BACKEND: Backend type (transformers|vllm)
Development
# Clone and install in development mode
git clone <repo>
cd llm-katan
pip install -e .
# Run with development dependencies
pip install -e ".[dev]"
License
Apache-2.0 License
Contributing
Contributions welcome! Please see the main repository for guidelines.
Created by Yossi Ovadia
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_katan-0.5.1.tar.gz.
File metadata
- Download URL: llm_katan-0.5.1.tar.gz
- Upload date:
- Size: 40.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4fd073cae499ee71c55e3e1ec691167d18bd5dc972dde4079a0fe756f61c286
|
|
| MD5 |
61812f633a8679b61a0f183be430b309
|
|
| BLAKE2b-256 |
59c2d0688c4519f29281336cd359ba5581a270a39534fcf614f5856ede57aea4
|
File details
Details for the file llm_katan-0.5.1-py3-none-any.whl.
File metadata
- Download URL: llm_katan-0.5.1-py3-none-any.whl
- Upload date:
- Size: 47.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb7c20852ce13d64de5f0ad6b7b36889ce24943357e154c49b91a3a3f0ecf4e0
|
|
| MD5 |
b54441fc92231d525b53443014ce2c53
|
|
| BLAKE2b-256 |
fe8c8d40c35827ad67dff9556316f12b94aeb6967529305c50f85f0cdf5523ab
|