Skip to main content

One tiny model, every LLM API. Drop-in test server for OpenAI, Anthropic, Bedrock, and Vertex.

Project description

LLM Katan - Lightweight LLM Server for Testing

A lightweight LLM serving package using FastAPI and HuggingFace transformers, designed for testing and development with real tiny models.

🎬 See Live Demo Interactive terminal showing multi-instance setup in action!

Features

  • 🚀 FastAPI-based: High-performance async web server
  • 🤗 HuggingFace Integration: Real model inference with transformers
  • Tiny Models: Ultra-lightweight models for fast testing (Qwen3-0.6B, etc.)
  • 🔄 Multi-Provider: Serve the same model as OpenAI, Anthropic, and more (Bedrock, Vertex coming soon)
  • 🎯 API Compatible: Drop-in replacement for provider endpoints with correct response formats
  • 🔐 Auth Validation: Each provider requires its native auth header (e.g., Authorization for OpenAI, x-api-key for Anthropic)
  • 📦 PyPI Ready: Easy installation and distribution
  • 🛠️ vLLM Support: Optional vLLM backend for production-like performance

Quick Start

Installation

Option 1: PyPI

pip install llm-katan

Option 2: Docker

# Pull and run the latest Docker image
docker pull ghcr.io/yossiovadia/llm-katan/llm-katan:latest
docker run -p 8000:8000 ghcr.io/yossiovadia/llm-katan/llm-katan:latest

# Or with custom model
docker run -p 8000:8000 ghcr.io/yossiovadia/llm-katan/llm-katan:latest \
  llm-katan --served-model-name "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Setup

HuggingFace Token (Required)

LLM Katan uses HuggingFace transformers to download models. You'll need a HuggingFace token for:

  • Private models
  • Avoiding rate limits
  • Reliable model downloads

Option 1: Environment Variable

export HUGGINGFACE_HUB_TOKEN="your_token_here"

Option 2: Login via CLI

huggingface-cli login

Option 3: Token file in home directory

# Create ~/.cache/huggingface/token file with your token
echo "your_token_here" > ~/.cache/huggingface/token

Get your token: Visit https://huggingface.co/settings/tokens

Basic Usage

# Echo mode — no model download, instant startup, no torch needed
llm-katan --model my-test-model --backend echo --providers openai,anthropic

# Start server with a tiny model (quantization enabled by default for speed)
llm-katan --model Qwen/Qwen3-0.6B --port 8000

# Start with custom served model name
llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Disable quantization for higher accuracy (slower)
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --no-quantize

# With vLLM backend (optional)
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --backend vllm

Multi-Instance Testing

🎬 Live Demo See this in action with animated terminals!

Note: If GitHub Pages isn't enabled, you can also download and open the demo locally

📺 Preview (click to expand)
# Terminal 1: Installing and starting GPT-3.5-Turbo mock
$ pip install llm-katan
Successfully installed llm-katan-0.1.8

$ llm-katan --model Qwen/Qwen3-0.6B --port 8000 --served-model-name "gpt-3.5-turbo"
🚀 Starting LLM Katan server with model: Qwen/Qwen3-0.6B
📛 Served model name: gpt-3.5-turbo
✅ Server running on http://0.0.0.0:8000

# Terminal 2: Starting Claude-3-Haiku mock
$ llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name "claude-3-haiku"
🚀 Starting LLM Katan server with model: Qwen/Qwen3-0.6B
📛 Served model name: claude-3-haiku
✅ Server running on http://0.0.0.0:8001

# Terminal 3: Testing both endpoints
$ curl localhost:8000/v1/models | jq '.data[0].id'
"gpt-3.5-turbo"

$ curl localhost:8001/v1/models | jq '.data[0].id'
"claude-3-haiku"

# Same tiny model, different API names! 🎯
# Terminal 1: Mock GPT-3.5-Turbo
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --served-model-name "gpt-3.5-turbo"

# Terminal 2: Mock Claude-3-Haiku
llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name "claude-3-haiku"

# Terminal 3: Test both endpoints
curl http://localhost:8000/v1/models  # Returns "gpt-3.5-turbo"
curl http://localhost:8001/v1/models  # Returns "claude-3-haiku"

Perfect for testing multi-provider scenarios with one tiny model!

How It Works

llm-katan does not proxy requests to real providers. There is no OpenAI SDK, no Anthropic SDK, no cloud API calls. Instead, each provider is a thin formatting layer around the same local model:

Request (any provider format)
       |
       v
Provider (openai.py / anthropic.py / ...)
  - Parses the provider-specific request format
  - Extracts: messages, max_tokens, temperature
  - Normalizes to plain Python dicts
       |
       v
Backend (model.py)
  - Converts messages to a prompt string
  - Feeds it directly to the local model (e.g., Qwen3-0.6B)
  - Returns: generated text + token counts
       |
       v
Provider (same one that handled the request)
  - Wraps the raw text in the provider's native response format
  - Returns to client

So Anthropic format in, Anthropic format out. OpenAI format in, OpenAI format out. The backend has zero knowledge of any provider format — it just generates text. No translation chain, no provider in the middle.

API Endpoints

Shared:

  • GET /health - Health check (shows active providers)
  • GET /metrics - Prometheus metrics

OpenAI (--providers openai):

  • GET /v1/models - List available models
  • POST /v1/chat/completions - Chat completions (OpenAI compatible)

Anthropic (--providers anthropic):

  • POST /v1/messages - Messages API (Anthropic compatible, with SSE streaming)

Vertex AI / Gemini (--providers vertexai):

  • POST /v1beta/models/{model}:generateContent - Generate content
  • POST /v1beta/models/{model}:streamGenerateContent - Streaming generate content
  • Also supports /v1/ prefix

AWS Bedrock (--providers bedrock):

  • POST /model/{modelId}/converse - Converse API (unified, model-agnostic)
  • POST /model/{modelId}/converse-stream - Streaming Converse API
  • POST /model/{modelId}/invoke - InvokeModel with all model families:
Model Family Model ID Prefix Request Format
Anthropic Claude anthropic.* Messages API (messages, max_tokens, content blocks)
Amazon Nova amazon.nova* Messages + content blocks + inferenceConfig
Amazon Titan amazon.titan* inputText + textGenerationConfig
Meta Llama meta.llama* prompt + max_gen_len
Cohere Command cohere.* message + chat_history
Mistral mistral.* prompt + max_tokens
DeepSeek deepseek.* prompt + max_tokens
AI21 Jamba ai21.* messages (OpenAI-like)

Unknown model IDs fall back to Amazon Titan format.

Enable all providers at once: --providers openai,anthropic,vertexai,bedrock

Example API Usage

# Basic chat completion
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2-0.5B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 50,
    "temperature": 0.7
  }'

# Creative writing example
curl -X POST http://127.0.0.1:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "messages": [
      {"role": "user", "content": "Write a short poem about coding"}
    ],
    "max_tokens": 100,
    "temperature": 0.8
  }'

# Check available models
curl http://127.0.0.1:8000/v1/models

# Health check
curl http://127.0.0.1:8000/health

# Anthropic Messages API
curl -X POST http://127.0.0.1:8000/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: test-key" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-test",
    "max_tokens": 50,
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

# Vertex AI / Gemini API
curl -X POST http://127.0.0.1:8000/v1beta/models/gemini-pro:generateContent \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer test-token" \
  -d '{
    "contents": [
      {"role": "user", "parts": [{"text": "What is the capital of France?"}]}
    ]
  }'

# AWS Bedrock Converse API
curl -X POST http://127.0.0.1:8000/model/anthropic.claude-v2/converse \
  -H "Content-Type: application/json" \
  -H "Authorization: AWS4-HMAC-SHA256 Credential=test" \
  -d '{
    "messages": [
      {"role": "user", "content": [{"text": "What is the capital of France?"}]}
    ]
  }'

# AWS Bedrock InvokeModel (Anthropic Claude)
curl -X POST http://127.0.0.1:8000/model/anthropic.claude-v2/invoke \
  -H "Content-Type: application/json" \
  -H "Authorization: AWS4-HMAC-SHA256 Credential=test" \
  -d '{
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 100,
    "messages": [{"role": "user", "content": "Hello"}]
  }'

CPU Optimization

LLM Katan includes automatic int8 quantization for CPU inference, providing significant performance improvements:

Performance Gains

  • 2-4x faster inference on CPU (on supported platforms)
  • 4x memory reduction
  • Enabled by default for best testing experience
  • Minimal quality impact (acceptable for testing scenarios)
  • Platform support: Works best on Linux x86_64; may not be available on all platforms (e.g., Mac)

When to Use Quantization

Enabled (default) - Recommended for:

  • Fast E2E testing
  • Development environments
  • CI/CD pipelines
  • Resource-constrained environments

Disabled (--no-quantize) - Use when you need:

  • Maximum accuracy (though tiny models have limited accuracy anyway)
  • Debugging precision-sensitive issues
  • Comparing with full-precision baselines

Example Performance

# Default: Fast with quantization (~50-100s per inference)
llm-katan --model Qwen/Qwen3-0.6B

# Slower but more accurate (~200s per inference)
llm-katan --model Qwen/Qwen3-0.6B --no-quantize

Note: Even with quantization, llm-katan is slower than production tools like LM Studio (which uses llama.cpp with extensive optimizations). For production workloads, use vLLM, Ollama, or similar solutions.

Use Cases

Strengths

  • Fastest time-to-test: 30 seconds from install to running
  • Optimized for CPU: Automatic int8 quantization for 2-4x speedup
  • Minimal resource footprint: Designed for tiny models and efficient testing
  • No GPU required: Runs on laptops, Macs, and any CPU-only environment
  • CI/CD integration friendly: Lightweight and automation-ready
  • Multiple instances: Run same model with different names on different ports

Ideal For

  • Automated testing pipelines: Quick LLM endpoint setup for test suites
  • Development environment mocking: Real inference without production overhead
  • Quick prototyping: Fast iteration with actual model behavior
  • Educational/learning scenarios: Easy setup for AI development learning

Not Ideal For

  • Production workloads: Use Ollama or vLLM for production deployments
  • Large model serving: Designed for tiny models (< 1B parameters)
  • Complex multi-agent workflows: Use Semantic Kernel or similar frameworks
  • High-performance inference: Use vLLM or specialized serving solutions

Configuration

Command Line Options

# All available options
llm-katan [OPTIONS]

Required:
  -m, --model TEXT              Model name to load (e.g., 'Qwen/Qwen3-0.6B') [required]

Optional:
  -n, --name, --served-model-name TEXT
                                Model name to serve via API (defaults to model name)
  -p, --port INTEGER            Port to serve on (default: 8000)
  -h, --host TEXT               Host to bind to (default: 0.0.0.0)
  -b, --backend [transformers|vllm|echo]  Backend to use (default: transformers)
  --max, --max-tokens INTEGER   Maximum tokens to generate (default: 512)
  -t, --temperature FLOAT       Sampling temperature (default: 0.7)
  -d, --device [auto|cpu|cuda]  Device to use (default: auto)
  --quantize/--no-quantize      Enable int8 quantization for faster CPU inference (default: enabled)
  --providers TEXT               Comma-separated providers to enable (default: openai)
  --max-concurrent INTEGER      Max concurrent inference requests (default: 1)
  --log-level [debug|info|warning|error]  Log level (default: INFO)
  --version                     Show version and exit
  --help                        Show help and exit

Advanced Usage Examples

# Serve all provider endpoints (auth always required per provider)
llm-katan --model Qwen/Qwen3-0.6B --providers openai,anthropic,vertexai,bedrock

# Custom generation settings
llm-katan --model Qwen/Qwen3-0.6B --max-tokens 1024 --temperature 0.9

# Force specific device with full precision (no quantization)
llm-katan --model Qwen/Qwen3-0.6B --device cpu --no-quantize --log-level debug

# Custom host and port
llm-katan --model Qwen/Qwen3-0.6B --host 127.0.0.1 --port 9000

# Multiple servers with different settings
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --max-tokens 512 --temperature 0.1
llm-katan --model Qwen/Qwen3-0.6B --port 8001 \
  --name "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --max-tokens 256 --temperature 0.9

Environment Variables

  • LLM_KATAN_MODEL: Default model to load
  • LLM_KATAN_PORT: Default port (8000)
  • LLM_KATAN_BACKEND: Backend type (transformers|vllm)

Development

# Clone and install in development mode
git clone <repo>
cd llm-katan
pip install -e .

# Run with development dependencies
pip install -e ".[dev]"

License

Apache-2.0 License

Contributing

Contributions welcome! Please see the main repository for guidelines.


Created by Yossi Ovadia

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_katan-0.5.2.tar.gz (42.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_katan-0.5.2-py3-none-any.whl (48.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_katan-0.5.2.tar.gz.

File metadata

  • Download URL: llm_katan-0.5.2.tar.gz
  • Upload date:
  • Size: 42.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for llm_katan-0.5.2.tar.gz
Algorithm Hash digest
SHA256 78e5fa2787a198e4db738d52015574d5ae44fcabdb33bd35488c2f5de1b9f243
MD5 c819e8a54a855d30a3a878f28a9f2636
BLAKE2b-256 b00d318135bf89786b392eb87ad69143c2ad2d8b0f0085328f4b67a846e45254

See more details on using hashes here.

File details

Details for the file llm_katan-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: llm_katan-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 48.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for llm_katan-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0b55e4a2a7f0e4647062ecdadc0644a5b96a79f6eefbce095b3e4cad2abb2e93
MD5 b41f575a26a04f9d3a11436dabb0f9cf
BLAKE2b-256 166389fc686ba81a43611f1a7350fb53850079098cd490aa2c9a29ff19e7b6a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page