One-command deployment of OpenAI-compatible APIs for open-source LLMs
Project description
InstaLLM
One-command deployment of OpenAI-compatible APIs for open-source LLMs.
InstaLLM is a developer tool that turns any open-source large language model into a production-ready API server with a single CLI command. It is designed for developers building AI applications who want the flexibility of open-source models with the convenience of the OpenAI API contract.
pip install installm
installm up --model Qwen/Qwen2.5-7B-Instruct
Your API is now live at http://localhost:8000. Point any OpenAI SDK or LangChain app at it:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
Features
| Feature | Description |
|---|---|
| One-command deployment | installm up --model <model> — that's it |
| OpenAI-compatible API | Drop-in replacement: change base_url, keep all your code |
| Four backends | vLLM (GPU), Transformers (CPU/MPS/CUDA), llama.cpp (GGUF), Ollama |
| Auto backend selection | Picks the best backend for your hardware automatically |
| SSE Streaming | Real-time token streaming via Server-Sent Events |
| Tool Calling | Native for vLLM/Ollama; prompt-and-parse fallback for Transformers/llama.cpp |
| Structured Outputs | json_object and json_schema with validate-and-retry fallback |
| Responses API | Semantic streaming events following the Open Responses spec |
| Multi-model | Deploy multiple models simultaneously, gateway routes by model field |
| Model Aliases | Map short names to long model IDs for convenience |
| API Key Authentication | Optional Bearer token auth, OpenAI-compatible |
| Docker support | CPU and GPU Dockerfiles included |
Installation
# Base install (Ollama backend only, Ollama must be installed separately)
pip install installm
# With Transformers backend (CPU / MPS / CUDA)
pip install "installm[transformers]"
# With vLLM backend (Linux + NVIDIA GPU only)
pip install "installm[vllm]"
# With llama.cpp backend (GGUF models)
pip install "installm[llamacpp]"
# Everything
pip install "installm[transformers,vllm,llamacpp]"
Requirements: Python 3.10+
Quick Start
1. Start a model
# Auto-selects the best backend for your hardware
installm up --model Qwen/Qwen2.5-7B-Instruct
# Force a specific backend
installm up --model Qwen/Qwen2.5-7B-Instruct --backend transformers
# Custom host and port
installm up --model Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8080
2. Use it — no code changes needed
from openai import OpenAI
# Just change base_url — everything else stays the same
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Explain transformers in one paragraph."}],
)
print(response.choices[0].message.content)
3. Streaming
stream = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Write a haiku about open-source AI."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
4. Tool Calling
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "What's the weather in Hong Kong?"}],
tools=tools,
tool_choice="auto",
)
tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name, tool_call.function.arguments)
5. Structured Outputs
import json
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Give me a person with name and age"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "person",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
},
"required": ["name", "age"],
},
},
},
)
person = json.loads(response.choices[0].message.content)
print(person) # {"name": "Alice", "age": 30}
6. Model Aliases
# Create a short alias for a long model ID
installm alias qwen Qwen/Qwen2.5-7B-Instruct
# Now use the alias in API calls
curl http://localhost:8000/v1/chat/completions \
-d '{"model": "qwen", "messages": [{"role": "user", "content": "Hi"}]}'
# Remove an alias
installm unalias qwen
7. Multi-model Serving
installm up --model Qwen/Qwen2.5-7B-Instruct --model mistralai/Mistral-7B-Instruct-v0.3
Both models are accessible through the same API — the gateway routes requests based on the model field.
8. Gated Models & HuggingFace Token
Some models (Llama, Gemma, Mistral-large, etc.) require you to accept the model licence on HuggingFace before downloading. Once accepted, generate a token at huggingface.co/settings/tokens and save it with InstaLLM:
# Save once — used automatically for all future downloads
installm token set hf_xxxxxxxxxxxxxxxxxxxxxxxx
# Check status
installm token status
# Remove saved token
installm token clear
The token is stored in ~/.installm/state.json. You can also pass it as a one-off flag or environment variable:
# One-off flag (not saved)
installm up --model meta-llama/Llama-3.1-8B-Instruct --token hf_xxxx
# Environment variable (always takes priority over saved token)
export HF_TOKEN=hf_xxxx
installm up --model meta-llama/Llama-3.1-8B-Instruct
Public models (Qwen, Phi, Mistral-7B, etc.) require no token at all.
9. Authentication
InstaLLM supports optional API key authentication that mirrors the OpenAI API pattern:
# Generate a key
installm auth create --label "dev-laptop"
# >> Key: sk-installm-a1b2c3d4... (save this!)
# Start the server with auth enabled
installm up --model Qwen/Qwen2.5-7B-Instruct --require-auth
# Or enable via environment variable
export INSTALLM_REQUIRE_AUTH=1
installm up --model Qwen/Qwen2.5-7B-Instruct
Clients authenticate exactly like they do with OpenAI — the api_key parameter just works:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-installm-a1b2c3d4...", # Your generated key
)
Key management:
installm auth ls # List active keys (prefix only)
installm auth revoke <id> # Revoke a key
When auth is not enabled (the default), all requests pass through without any key — fully backward compatible.
9. Framework Compatibility
InstaLLM works with any framework that supports the OpenAI API:
LangChain:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed",
model="Qwen/Qwen2.5-7B-Instruct",
)
print(llm.invoke("What is InstaLLM?").content)
CrewAI:
from crewai import LLM
llm = LLM(
model="openai/Qwen/Qwen2.5-7B-Instruct",
base_url="http://localhost:8000/v1",
api_key="not-needed",
)
CLI Reference
| Command | Description |
|---|---|
installm up --model <id> [--model <id>...] |
Pull model(s) and start the API server |
installm pull --model <id> |
Download a model without starting the server |
installm ls |
List all downloaded models and aliases |
installm down |
Stop the running server |
installm logs |
Show recent server logs |
installm alias <name> <model_id> |
Create a short alias for a model ID |
installm unalias <name> |
Remove a model alias |
installm token set <token> |
Save HuggingFace token for gated model downloads |
installm token status |
Show whether a HuggingFace token is saved |
installm token clear |
Remove the saved HuggingFace token |
installm auth create [--label] |
Generate a new API key |
installm auth ls |
List active API keys (prefix only) |
installm auth revoke <id> |
Revoke an API key |
installm up options
| Option | Default | Description |
|---|---|---|
--model, -m |
(required) | HuggingFace model ID (repeatable for multi-model) |
--host |
0.0.0.0 |
Bind address |
--port |
8000 |
Port number |
--backend |
auto | Force a backend: transformers, vllm, ollama, or llamacpp |
--require-auth |
off | Require API key authentication for all requests |
--token |
(saved/env) | HuggingFace token for gated models (one-off; use installm token set to save) |
API Reference
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Liveness check |
/v1/models |
GET | List loaded models and aliases (OpenAI format) |
/v1/chat/completions |
POST | Chat completion (streaming and non-streaming) |
/v1/embeddings |
POST | Text embeddings |
/v1/responses |
POST | Responses API with semantic streaming events |
Backends
InstaLLM auto-selects the best available backend in this order:
- vLLM — highest throughput, requires Linux + NVIDIA GPU with CUDA
- Transformers — universal, works on CPU / Apple MPS / CUDA
- llama.cpp — efficient GGUF inference, works on CPU and GPU
- Ollama — requires Ollama to be installed separately
You can force a specific backend with --backend:
installm up --model Qwen/Qwen2.5-7B-Instruct --backend transformers
Backend capabilities
| Feature | vLLM | Transformers | llama.cpp | Ollama |
|---|---|---|---|---|
| Tool calling | Native | Gateway | Gateway | Native |
| Structured outputs | Native | Gateway | Gateway | Native |
| Streaming | Yes | Yes | Yes | Yes |
| Embeddings | Yes | Yes | Yes | Yes |
| GPU required | Yes | No | No | No |
| Platform | Linux | All | All | All |
Native means the inference engine enforces the constraint at the model level. Gateway means InstaLLM handles it transparently — tool calls are injected via a system prompt and parsed from the model output; structured outputs are validated against the schema with automatic retries. From the API caller's perspective, both modes are identical.
Platform notes
- vLLM raises a clear error on Windows/macOS or when no CUDA GPU is detected
- Transformers auto-detects CUDA > MPS > CPU
- llama.cpp works with GGUF model files; auto-resolves from HuggingFace cache
- Ollama requires the Ollama daemon to be running (
ollama serve)
Docker
CPU
docker build -t installm .
docker run -p 8000:8000 installm up --model Qwen/Qwen2.5-0.5B-Instruct
GPU (NVIDIA)
docker build --build-arg BASE=nvidia/cuda:12.1.0-runtime-ubuntu22.04 -t installm-gpu .
docker run --gpus all -p 8000:8000 installm-gpu up --model Qwen/Qwen2.5-7B-Instruct
Docker Compose
docker compose up
Project Structure
src/installm/
├── __init__.py # Version
├── auth.py # API key generation, hashing, validation
├── cli.py # Click CLI (up, down, ls, pull, alias, auth, logs)
├── config.py # State manifest, aliases (~/.installm/state.json)
├── download.py # HuggingFace Hub integration
├── backends/
│ ├── __init__.py # Backend registry and auto-selection
│ ├── base.py # Abstract base class
│ ├── transformers.py # HF Transformers backend
│ ├── vllm.py # vLLM backend
│ ├── llamacpp.py # llama.cpp backend (GGUF)
│ └── ollama.py # Ollama backend
└── gateway/
├── __init__.py
├── app.py # FastAPI app, backend registry, server launcher
├── middleware.py # Auth middleware (Bearer token validation)
├── schemas.py # Pydantic models (OpenAI contract)
├── streaming.py # SSE helpers
├── tools.py # Tool calling prompt injection and parsing
├── structured.py # JSON mode and validate-and-retry
└── routes/
├── __init__.py
├── models.py # GET /v1/models
├── chat.py # POST /v1/chat/completions
├── embeddings.py# POST /v1/embeddings
└── responses.py # POST /v1/responses
Testing
# Install test dependencies
pip install "installm[transformers]" pytest pytest-asyncio httpx
# Run unit tests (fast, no model download, no GPU needed)
pytest tests/ --ignore=tests/test_integration_live.py --ignore=tests/test_e2e_qwen.py -v
# Run live integration tests (downloads a 2.5MB test model)
pytest tests/test_integration_live.py -v
# Run full e2e tests with OpenAI SDK + LangChain (downloads Qwen2.5-0.5B)
pytest tests/test_e2e_qwen.py -v
# Run everything
pytest tests/ -v
Test coverage
| Test file | What it covers |
|---|---|
test_config.py |
State manifest CRUD, server info lifecycle |
test_alias.py |
Alias set/remove/resolve, backward compat |
test_cli.py |
CLI help, ls, pull commands |
test_download.py |
HF Hub download, caching |
test_backends/test_ollama.py |
Ollama backend (mocked) |
test_backends/test_transformers.py |
Transformers backend (mocked) |
test_backends/test_vllm.py |
vLLM backend (mocked) |
test_backends/test_llamacpp.py |
llama.cpp backend (mocked) |
test_gateway/test_health.py |
Health endpoint |
test_gateway/test_models.py |
Models list endpoint |
test_gateway/test_chat.py |
Chat completions (non-streaming, streaming, tools, structured) |
test_gateway/test_embeddings.py |
Embeddings endpoint |
test_gateway/test_responses.py |
Responses API (non-streaming, streaming events) |
test_gateway/test_tools_and_structured.py |
Tool prompt builder, JSON parser, validate-and-retry |
test_auth.py |
Key CRUD, validation, middleware (401/200), CLI commands |
test_integration_live.py |
Live test with tiny-gpt2 model |
test_e2e_qwen.py |
Full e2e with OpenAI SDK, LangChain, tool calling, JSON mode |
Future Work
- Per-key rate limiting — throttle requests per API key
- Prometheus Metrics —
/metricsendpoint for monitoring - Model Routing — route requests to different models based on rules
- Observability Dashboard — request logs, latency metrics, token usage
- TensorRT-LLM backend — NVIDIA-optimised inference for production
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file installm-0.2.0.tar.gz.
File metadata
- Download URL: installm-0.2.0.tar.gz
- Upload date:
- Size: 43.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0rc1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6826a68d899660c92cc669b55bc89c6d51b1338f6cb73a27e672c81eeb684cfb
|
|
| MD5 |
24cd0e656a1e4bdd4ad753c9b76c0be9
|
|
| BLAKE2b-256 |
1b8f36cb974c0705aee878de4704778ce5f23f03c6d5411bb715cd25298b970d
|
File details
Details for the file installm-0.2.0-py3-none-any.whl.
File metadata
- Download URL: installm-0.2.0-py3-none-any.whl
- Upload date:
- Size: 37.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0rc1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bfa16e009b2a6b913973e76f08ad6d624253e321c8688f0cb898f3e0dc75f623
|
|
| MD5 |
f85a9723ff153976bad3e1b82949aaca
|
|
| BLAKE2b-256 |
3c49f8a1010ecfcc56cb4c88a2124ce73c81c0b425200ee1ec9eb1fa30f462ca
|