One tiny model, every LLM API. Drop-in test server for OpenAI, Anthropic, Bedrock, and Vertex.

These details have not been verified by PyPI

Project links

Project description

LLM Katan - Lightweight LLM Server for Testing

A lightweight LLM serving package using FastAPI and HuggingFace transformers, designed for testing and development with real tiny models.

🎬 See Live Demo Interactive terminal showing multi-instance setup in action!

Features

🚀 FastAPI-based: High-performance async web server
🤗 HuggingFace Integration: Real model inference with transformers
⚡ Tiny Models: Ultra-lightweight models for fast testing (Qwen3-0.6B, etc.)
🔄 Multi-Provider: Serve the same model as OpenAI, Anthropic, and more (Bedrock, Vertex coming soon)
🎯 API Compatible: Drop-in replacement for provider endpoints with correct response formats
🔐 Auth Validation: Each provider requires its native auth header (e.g., Authorization for OpenAI, x-api-key for Anthropic)
📦 PyPI Ready: Easy installation and distribution
🛠️ vLLM Support: Optional vLLM backend for production-like performance

Quick Start

Installation

Option 1: PyPI

pip install llm-katan

Option 2: Docker

# Pull and run the latest Docker image
docker pull ghcr.io/yossiovadia/llm-katan/llm-katan:latest
docker run -p 8000:8000 ghcr.io/yossiovadia/llm-katan/llm-katan:latest

# Or with custom model
docker run -p 8000:8000 ghcr.io/yossiovadia/llm-katan/llm-katan:latest \
  llm-katan --served-model-name "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Setup

HuggingFace Token (Required)

LLM Katan uses HuggingFace transformers to download models. You'll need a HuggingFace token for:

Private models
Avoiding rate limits
Reliable model downloads

Option 1: Environment Variable

export HUGGINGFACE_HUB_TOKEN="your_token_here"

Option 2: Login via CLI

huggingface-cli login

Option 3: Token file in home directory

# Create ~/.cache/huggingface/token file with your token
echo "your_token_here" > ~/.cache/huggingface/token

Get your token: Visit https://huggingface.co/settings/tokens

Basic Usage

# Echo mode — no model download, instant startup, no torch needed
llm-katan --model my-test-model --backend echo --providers openai,anthropic

# Start server with a tiny model (quantization enabled by default for speed)
llm-katan --model Qwen/Qwen3-0.6B --port 8000

# Start with custom served model name
llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Disable quantization for higher accuracy (slower)
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --no-quantize

# With vLLM backend (optional)
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --backend vllm

Multi-Instance Testing

🎬 Live Demo See this in action with animated terminals!

Note: If GitHub Pages isn't enabled, you can also download and open the demo locally

📺 Preview (click to expand)

# Terminal 1: Installing and starting GPT-3.5-Turbo mock
$ pip install llm-katan
Successfully installed llm-katan-0.1.8

$ llm-katan --model Qwen/Qwen3-0.6B --port 8000 --served-model-name "gpt-3.5-turbo"
🚀 Starting LLM Katan server with model: Qwen/Qwen3-0.6B
📛 Served model name: gpt-3.5-turbo
✅ Server running on http://0.0.0.0:8000

# Terminal 2: Starting Claude-3-Haiku mock
$ llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name "claude-3-haiku"
🚀 Starting LLM Katan server with model: Qwen/Qwen3-0.6B
📛 Served model name: claude-3-haiku
✅ Server running on http://0.0.0.0:8001

# Terminal 3: Testing both endpoints
$ curl localhost:8000/v1/models | jq '.data[0].id'
"gpt-3.5-turbo"

$ curl localhost:8001/v1/models | jq '.data[0].id'
"claude-3-haiku"

# Same tiny model, different API names! 🎯

# Terminal 1: Mock GPT-3.5-Turbo
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --served-model-name "gpt-3.5-turbo"

# Terminal 2: Mock Claude-3-Haiku
llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name "claude-3-haiku"

# Terminal 3: Test both endpoints
curl http://localhost:8000/v1/models  # Returns "gpt-3.5-turbo"
curl http://localhost:8001/v1/models  # Returns "claude-3-haiku"

Perfect for testing multi-provider scenarios with one tiny model!

How It Works

llm-katan does not proxy requests to real providers. There is no OpenAI SDK, no Anthropic SDK, no cloud API calls. Instead, each provider is a thin formatting layer around the same local model:

Request (any provider format)
       |
       v
Provider (openai.py / anthropic.py / ...)
  - Parses the provider-specific request format
  - Extracts: messages, max_tokens, temperature
  - Normalizes to plain Python dicts
       |
       v
Backend (model.py)
  - Converts messages to a prompt string
  - Feeds it directly to the local model (e.g., Qwen3-0.6B)
  - Returns: generated text + token counts
       |
       v
Provider (same one that handled the request)
  - Wraps the raw text in the provider's native response format
  - Returns to client

So Anthropic format in, Anthropic format out. OpenAI format in, OpenAI format out. The backend has zero knowledge of any provider format — it just generates text. No translation chain, no provider in the middle.

API Endpoints

Shared:

GET /health - Health check (shows active providers)
GET /metrics - Prometheus metrics

OpenAI (--providers openai):

GET /v1/models - List available models
POST /v1/chat/completions - Chat completions (OpenAI compatible)

Anthropic (--providers anthropic):

POST /v1/messages - Messages API (Anthropic compatible, with SSE streaming)

Vertex AI / Gemini (--providers vertexai):

POST /v1beta/models/{model}:generateContent - Generate content
POST /v1beta/models/{model}:streamGenerateContent - Streaming generate content
Also supports /v1/ prefix

AWS Bedrock (--providers bedrock):

POST /model/{modelId}/converse - Converse API (unified, model-agnostic)
POST /model/{modelId}/converse-stream - Streaming Converse API
POST /model/{modelId}/invoke - InvokeModel (Anthropic Claude format when model ID contains "anthropic" or "claude", Amazon Titan format otherwise)

Enable all providers at once: --providers openai,anthropic,vertexai,bedrock

Example API Usage

# Basic chat completion
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2-0.5B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 50,
    "temperature": 0.7
  }'

# Creative writing example
curl -X POST http://127.0.0.1:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "messages": [
      {"role": "user", "content": "Write a short poem about coding"}
    ],
    "max_tokens": 100,
    "temperature": 0.8
  }'

# Check available models
curl http://127.0.0.1:8000/v1/models

# Health check
curl http://127.0.0.1:8000/health

# Anthropic Messages API
curl -X POST http://127.0.0.1:8000/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: test-key" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-test",
    "max_tokens": 50,
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

# Vertex AI / Gemini API
curl -X POST http://127.0.0.1:8000/v1beta/models/gemini-pro:generateContent \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer test-token" \
  -d '{
    "contents": [
      {"role": "user", "parts": [{"text": "What is the capital of France?"}]}
    ]
  }'

# AWS Bedrock Converse API
curl -X POST http://127.0.0.1:8000/model/anthropic.claude-v2/converse \
  -H "Content-Type: application/json" \
  -H "Authorization: AWS4-HMAC-SHA256 Credential=test" \
  -d '{
    "messages": [
      {"role": "user", "content": [{"text": "What is the capital of France?"}]}
    ]
  }'

# AWS Bedrock InvokeModel (Anthropic Claude)
curl -X POST http://127.0.0.1:8000/model/anthropic.claude-v2/invoke \
  -H "Content-Type: application/json" \
  -H "Authorization: AWS4-HMAC-SHA256 Credential=test" \
  -d '{
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 100,
    "messages": [{"role": "user", "content": "Hello"}]
  }'

CPU Optimization

LLM Katan includes automatic int8 quantization for CPU inference, providing significant performance improvements:

Performance Gains

2-4x faster inference on CPU (on supported platforms)
4x memory reduction
Enabled by default for best testing experience
Minimal quality impact (acceptable for testing scenarios)
Platform support: Works best on Linux x86_64; may not be available on all platforms (e.g., Mac)

When to Use Quantization

✅ Enabled (default) - Recommended for:

Fast E2E testing
Development environments
CI/CD pipelines
Resource-constrained environments

❌ Disabled (--no-quantize) - Use when you need:

Maximum accuracy (though tiny models have limited accuracy anyway)
Debugging precision-sensitive issues
Comparing with full-precision baselines

Example Performance

# Default: Fast with quantization (~50-100s per inference)
llm-katan --model Qwen/Qwen3-0.6B

# Slower but more accurate (~200s per inference)
llm-katan --model Qwen/Qwen3-0.6B --no-quantize

Note: Even with quantization, llm-katan is slower than production tools like LM Studio (which uses llama.cpp with extensive optimizations). For production workloads, use vLLM, Ollama, or similar solutions.

Use Cases

Strengths

Fastest time-to-test: 30 seconds from install to running
Optimized for CPU: Automatic int8 quantization for 2-4x speedup
Minimal resource footprint: Designed for tiny models and efficient testing
No GPU required: Runs on laptops, Macs, and any CPU-only environment
CI/CD integration friendly: Lightweight and automation-ready
Multiple instances: Run same model with different names on different ports

Ideal For

Automated testing pipelines: Quick LLM endpoint setup for test suites
Development environment mocking: Real inference without production overhead
Quick prototyping: Fast iteration with actual model behavior
Educational/learning scenarios: Easy setup for AI development learning

Not Ideal For

Production workloads: Use Ollama or vLLM for production deployments
Large model serving: Designed for tiny models (< 1B parameters)
Complex multi-agent workflows: Use Semantic Kernel or similar frameworks
High-performance inference: Use vLLM or specialized serving solutions

Configuration

Command Line Options

# All available options
llm-katan [OPTIONS]

Required:
  -m, --model TEXT              Model name to load (e.g., 'Qwen/Qwen3-0.6B') [required]

Optional:
  -n, --name, --served-model-name TEXT
                                Model name to serve via API (defaults to model name)
  -p, --port INTEGER            Port to serve on (default: 8000)
  -h, --host TEXT               Host to bind to (default: 0.0.0.0)
  -b, --backend [transformers|vllm|echo]  Backend to use (default: transformers)
  --max, --max-tokens INTEGER   Maximum tokens to generate (default: 512)
  -t, --temperature FLOAT       Sampling temperature (default: 0.7)
  -d, --device [auto|cpu|cuda]  Device to use (default: auto)
  --quantize/--no-quantize      Enable int8 quantization for faster CPU inference (default: enabled)
  --providers TEXT               Comma-separated providers to enable (default: openai)
  --max-concurrent INTEGER      Max concurrent inference requests (default: 1)
  --log-level [debug|info|warning|error]  Log level (default: INFO)
  --version                     Show version and exit
  --help                        Show help and exit

Advanced Usage Examples

# Serve all provider endpoints (auth always required per provider)
llm-katan --model Qwen/Qwen3-0.6B --providers openai,anthropic,vertexai,bedrock

# Custom generation settings
llm-katan --model Qwen/Qwen3-0.6B --max-tokens 1024 --temperature 0.9

# Force specific device with full precision (no quantization)
llm-katan --model Qwen/Qwen3-0.6B --device cpu --no-quantize --log-level debug

# Custom host and port
llm-katan --model Qwen/Qwen3-0.6B --host 127.0.0.1 --port 9000

# Multiple servers with different settings
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --max-tokens 512 --temperature 0.1
llm-katan --model Qwen/Qwen3-0.6B --port 8001 \
  --name "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --max-tokens 256 --temperature 0.9

Environment Variables

LLM_KATAN_MODEL: Default model to load
LLM_KATAN_PORT: Default port (8000)
LLM_KATAN_BACKEND: Backend type (transformers|vllm)

Development

# Clone and install in development mode
git clone <repo>
cd llm-katan
pip install -e .

# Run with development dependencies
pip install -e ".[dev]"

License

Apache-2.0 License

Contributing

Contributions welcome! Please see the main repository for guidelines.

Created by Yossi Ovadia

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.14.0

Apr 21, 2026

0.13.0

Apr 18, 2026

0.12.3

Apr 18, 2026

0.12.2

Apr 18, 2026

0.12.1

Apr 18, 2026

0.12.0

Apr 18, 2026

0.11.0

Apr 13, 2026

0.10.0

Apr 10, 2026

0.9.0

Mar 26, 2026

0.8.2

Mar 26, 2026

0.8.1

Mar 26, 2026

0.8.0

Mar 26, 2026

0.7.3

Mar 20, 2026

0.7.2

Mar 20, 2026

0.7.1

Mar 20, 2026

0.7.0

Mar 20, 2026

0.6.0

Mar 19, 2026

0.5.2

Mar 19, 2026

This version

0.5.1

Mar 19, 2026

0.5.0

Mar 19, 2026

0.4.0

Mar 19, 2026

0.3.2

Mar 19, 2026

0.3.1

Mar 19, 2026

0.3.0

Mar 19, 2026

0.2.1

Mar 19, 2026

0.2.0

Mar 19, 2026

0.1.10

Oct 29, 2025

0.1.9

Oct 6, 2025

0.1.8

Sep 26, 2025

0.1.7

Sep 26, 2025

0.1.6

Sep 26, 2025

0.1.5

Sep 26, 2025

0.1.4

Sep 26, 2025

0.1.3

Sep 26, 2025

0.1.2

Sep 26, 2025

0.1.1

Sep 26, 2025

0.1.0

Sep 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_katan-0.5.1.tar.gz (40.3 kB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_katan-0.5.1-py3-none-any.whl (47.5 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file llm_katan-0.5.1.tar.gz.

File metadata

Download URL: llm_katan-0.5.1.tar.gz
Upload date: Mar 19, 2026
Size: 40.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for llm_katan-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`a4fd073cae499ee71c55e3e1ec691167d18bd5dc972dde4079a0fe756f61c286`
MD5	`61812f633a8679b61a0f183be430b309`
BLAKE2b-256	`59c2d0688c4519f29281336cd359ba5581a270a39534fcf614f5856ede57aea4`

See more details on using hashes here.

File details

Details for the file llm_katan-0.5.1-py3-none-any.whl.

File metadata

Download URL: llm_katan-0.5.1-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 47.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for llm_katan-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eb7c20852ce13d64de5f0ad6b7b36889ce24943357e154c49b91a3a3f0ecf4e0`
MD5	`b54441fc92231d525b53443014ce2c53`
BLAKE2b-256	`fe8c8d40c35827ad67dff9556316f12b94aeb6967529305c50f85f0cdf5523ab`

See more details on using hashes here.

llm-katan 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLM Katan - Lightweight LLM Server for Testing

Features

Quick Start

Installation

Option 1: PyPI

Option 2: Docker

Setup

HuggingFace Token (Required)

Option 1: Environment Variable

Option 2: Login via CLI

Option 3: Token file in home directory

Basic Usage

Multi-Instance Testing

How It Works

API Endpoints

Example API Usage

CPU Optimization

Performance Gains

When to Use Quantization

Example Performance

Use Cases

Strengths

Ideal For

Not Ideal For

Configuration

Command Line Options

Advanced Usage Examples

Environment Variables

Development

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes