Skip to main content

One-command deployment of OpenAI-compatible APIs for open-source LLMs

Project description

InstaLLM

PyPI version Python 3.10+ License: MIT

One-command deployment of OpenAI-compatible APIs for open-source LLMs.

InstaLLM is a developer tool that turns any open-source large language model into a production-ready API server with a single CLI command. It is designed for developers building AI applications who want the flexibility of open-source models with the convenience of the OpenAI API contract.

pip install installm
installm up --model Qwen/Qwen2.5-7B-Instruct

Your API is now live at http://localhost:8000. Point any OpenAI SDK or LangChain app at it:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)

Features

Feature Description
One-command deployment installm up --model <model> — that's it
OpenAI-compatible API Drop-in replacement: change base_url, keep all your code
Four backends vLLM (GPU), Transformers (CPU/MPS/CUDA), llama.cpp (GGUF), Ollama
Auto backend selection Picks the best backend for your hardware automatically
SSE Streaming Real-time token streaming via Server-Sent Events
Tool Calling Native for vLLM/Ollama; prompt-and-parse fallback for Transformers/llama.cpp
Structured Outputs json_object and json_schema with validate-and-retry fallback
Responses API Semantic streaming events following the Open Responses spec
Multi-model Deploy multiple models simultaneously, gateway routes by model field
Model Aliases Map short names to long model IDs for convenience
API Key Authentication Optional Bearer token auth, OpenAI-compatible
Docker support CPU and GPU Dockerfiles included

Installation

# Base install (Ollama backend only, Ollama must be installed separately)
pip install installm

# With Transformers backend (CPU / MPS / CUDA)
pip install "installm[transformers]"

# With vLLM backend (Linux + NVIDIA GPU only)
pip install "installm[vllm]"

# With llama.cpp backend (GGUF models)
pip install "installm[llamacpp]"

# Everything
pip install "installm[transformers,vllm,llamacpp]"

Requirements: Python 3.10+


Quick Start

1. Start a model

# Auto-selects the best backend for your hardware
installm up --model Qwen/Qwen2.5-7B-Instruct

# Force a specific backend
installm up --model Qwen/Qwen2.5-7B-Instruct --backend transformers

# Custom host and port
installm up --model Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8080

2. Use it — no code changes needed

from openai import OpenAI

# Just change base_url — everything else stays the same
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Explain transformers in one paragraph."}],
)
print(response.choices[0].message.content)

3. Streaming

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about open-source AI."}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

4. Tool Calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "What's the weather in Hong Kong?"}],
    tools=tools,
    tool_choice="auto",
)
tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name, tool_call.function.arguments)

5. Structured Outputs

import json

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Give me a person with name and age"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                },
                "required": ["name", "age"],
            },
        },
    },
)
person = json.loads(response.choices[0].message.content)
print(person)  # {"name": "Alice", "age": 30}

6. Model Aliases

# Create a short alias for a long model ID
installm alias qwen Qwen/Qwen2.5-7B-Instruct

# Now use the alias in API calls
curl http://localhost:8000/v1/chat/completions \
  -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hi"}]}'

# Remove an alias
installm unalias qwen

7. Multi-model Serving

installm up --model Qwen/Qwen2.5-7B-Instruct --model mistralai/Mistral-7B-Instruct-v0.3

Both models are accessible through the same API — the gateway routes requests based on the model field.

8. Gated Models & HuggingFace Token

Some models (Llama, Gemma, Mistral-large, etc.) require you to accept the model licence on HuggingFace before downloading. Once accepted, generate a token at huggingface.co/settings/tokens and save it with InstaLLM:

# Save once — used automatically for all future downloads
installm token set hf_xxxxxxxxxxxxxxxxxxxxxxxx

# Check status
installm token status

# Remove saved token
installm token clear

The token is stored in ~/.installm/state.json. You can also pass it as a one-off flag or environment variable:

# One-off flag (not saved)
installm up --model meta-llama/Llama-3.1-8B-Instruct --token hf_xxxx

# Environment variable (always takes priority over saved token)
export HF_TOKEN=hf_xxxx
installm up --model meta-llama/Llama-3.1-8B-Instruct

Public models (Qwen, Phi, Mistral-7B, etc.) require no token at all.

9. Authentication

InstaLLM supports optional API key authentication that mirrors the OpenAI API pattern:

# Generate a key
installm auth create --label "dev-laptop"
# >> Key: sk-installm-a1b2c3d4...  (save this!)

# Start the server with auth enabled
installm up --model Qwen/Qwen2.5-7B-Instruct --require-auth

# Or enable via environment variable
export INSTALLM_REQUIRE_AUTH=1
installm up --model Qwen/Qwen2.5-7B-Instruct

Clients authenticate exactly like they do with OpenAI — the api_key parameter just works:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-installm-a1b2c3d4...",  # Your generated key
)

Key management:

installm auth ls        # List active keys (prefix only)
installm auth revoke <id>  # Revoke a key

When auth is not enabled (the default), all requests pass through without any key — fully backward compatible.

9. Framework Compatibility

InstaLLM works with any framework that supports the OpenAI API:

LangChain:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    model="Qwen/Qwen2.5-7B-Instruct",
)
print(llm.invoke("What is InstaLLM?").content)

CrewAI:

from crewai import LLM

llm = LLM(
    model="openai/Qwen/Qwen2.5-7B-Instruct",
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

CLI Reference

Command Description
installm up --model <id> [--model <id>...] Pull model(s) and start the API server
installm pull --model <id> Download a model without starting the server
installm ls List all downloaded models and aliases
installm down Stop the running server
installm logs Show recent server logs
installm alias <name> <model_id> Create a short alias for a model ID
installm unalias <name> Remove a model alias
installm token set <token> Save HuggingFace token for gated model downloads
installm token status Show whether a HuggingFace token is saved
installm token clear Remove the saved HuggingFace token
installm auth create [--label] Generate a new API key
installm auth ls List active API keys (prefix only)
installm auth revoke <id> Revoke an API key

installm up options

Option Default Description
--model, -m (required) HuggingFace model ID (repeatable for multi-model)
--host 0.0.0.0 Bind address
--port 8000 Port number
--backend auto Force a backend: transformers, vllm, ollama, or llamacpp
--require-auth off Require API key authentication for all requests
--token (saved/env) HuggingFace token for gated models (one-off; use installm token set to save)

API Reference

Endpoint Method Description
/health GET Liveness check
/v1/models GET List loaded models and aliases (OpenAI format)
/v1/chat/completions POST Chat completion (streaming and non-streaming)
/v1/embeddings POST Text embeddings
/v1/responses POST Responses API with semantic streaming events

Backends

InstaLLM auto-selects the best available backend in this order:

  1. vLLM — highest throughput, requires Linux + NVIDIA GPU with CUDA
  2. Transformers — universal, works on CPU / Apple MPS / CUDA
  3. llama.cpp — efficient GGUF inference, works on CPU and GPU
  4. Ollama — requires Ollama to be installed separately

You can force a specific backend with --backend:

installm up --model Qwen/Qwen2.5-7B-Instruct --backend transformers

Backend capabilities

Feature vLLM Transformers llama.cpp Ollama
Tool calling Native Gateway Gateway Native
Structured outputs Native Gateway Gateway Native
Streaming Yes Yes Yes Yes
Embeddings Yes Yes Yes Yes
GPU required Yes No No No
Platform Linux All All All

Native means the inference engine enforces the constraint at the model level. Gateway means InstaLLM handles it transparently — tool calls are injected via a system prompt and parsed from the model output; structured outputs are validated against the schema with automatic retries. From the API caller's perspective, both modes are identical.

Platform notes

  • vLLM raises a clear error on Windows/macOS or when no CUDA GPU is detected
  • Transformers auto-detects CUDA > MPS > CPU
  • llama.cpp works with GGUF model files; auto-resolves from HuggingFace cache
  • Ollama requires the Ollama daemon to be running (ollama serve)

Docker

CPU

docker build -t installm .
docker run -p 8000:8000 installm up --model Qwen/Qwen2.5-0.5B-Instruct

GPU (NVIDIA)

docker build --build-arg BASE=nvidia/cuda:12.1.0-runtime-ubuntu22.04 -t installm-gpu .
docker run --gpus all -p 8000:8000 installm-gpu up --model Qwen/Qwen2.5-7B-Instruct

Docker Compose

docker compose up

Project Structure

src/installm/
├── __init__.py          # Version
├── auth.py              # API key generation, hashing, validation
├── cli.py               # Click CLI (up, down, ls, pull, alias, auth, logs)
├── config.py            # State manifest, aliases (~/.installm/state.json)
├── download.py          # HuggingFace Hub integration
├── backends/
│   ├── __init__.py      # Backend registry and auto-selection
│   ├── base.py          # Abstract base class
│   ├── transformers.py  # HF Transformers backend
│   ├── vllm.py          # vLLM backend
│   ├── llamacpp.py      # llama.cpp backend (GGUF)
│   └── ollama.py        # Ollama backend
└── gateway/
    ├── __init__.py
    ├── app.py           # FastAPI app, backend registry, server launcher
    ├── middleware.py     # Auth middleware (Bearer token validation)
    ├── schemas.py       # Pydantic models (OpenAI contract)
    ├── streaming.py     # SSE helpers
    ├── tools.py         # Tool calling prompt injection and parsing
    ├── structured.py    # JSON mode and validate-and-retry
    └── routes/
        ├── __init__.py
        ├── models.py    # GET /v1/models
        ├── chat.py      # POST /v1/chat/completions
        ├── embeddings.py# POST /v1/embeddings
        └── responses.py # POST /v1/responses

Testing

# Install test dependencies
pip install "installm[transformers]" pytest pytest-asyncio httpx

# Run unit tests (fast, no model download, no GPU needed)
pytest tests/ --ignore=tests/test_integration_live.py --ignore=tests/test_e2e_qwen.py -v

# Run live integration tests (downloads a 2.5MB test model)
pytest tests/test_integration_live.py -v

# Run full e2e tests with OpenAI SDK + LangChain (downloads Qwen2.5-0.5B)
pytest tests/test_e2e_qwen.py -v

# Run everything
pytest tests/ -v

Test coverage

Test file What it covers
test_config.py State manifest CRUD, server info lifecycle
test_alias.py Alias set/remove/resolve, backward compat
test_cli.py CLI help, ls, pull commands
test_download.py HF Hub download, caching
test_backends/test_ollama.py Ollama backend (mocked)
test_backends/test_transformers.py Transformers backend (mocked)
test_backends/test_vllm.py vLLM backend (mocked)
test_backends/test_llamacpp.py llama.cpp backend (mocked)
test_gateway/test_health.py Health endpoint
test_gateway/test_models.py Models list endpoint
test_gateway/test_chat.py Chat completions (non-streaming, streaming, tools, structured)
test_gateway/test_embeddings.py Embeddings endpoint
test_gateway/test_responses.py Responses API (non-streaming, streaming events)
test_gateway/test_tools_and_structured.py Tool prompt builder, JSON parser, validate-and-retry
test_auth.py Key CRUD, validation, middleware (401/200), CLI commands
test_integration_live.py Live test with tiny-gpt2 model
test_e2e_qwen.py Full e2e with OpenAI SDK, LangChain, tool calling, JSON mode

Future Work

  • Per-key rate limiting — throttle requests per API key
  • Prometheus Metrics/metrics endpoint for monitoring
  • Model Routing — route requests to different models based on rules
  • Observability Dashboard — request logs, latency metrics, token usage
  • TensorRT-LLM backend — NVIDIA-optimised inference for production

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

installm-0.2.0.tar.gz (43.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

installm-0.2.0-py3-none-any.whl (37.9 kB view details)

Uploaded Python 3

File details

Details for the file installm-0.2.0.tar.gz.

File metadata

  • Download URL: installm-0.2.0.tar.gz
  • Upload date:
  • Size: 43.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0rc1

File hashes

Hashes for installm-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6826a68d899660c92cc669b55bc89c6d51b1338f6cb73a27e672c81eeb684cfb
MD5 24cd0e656a1e4bdd4ad753c9b76c0be9
BLAKE2b-256 1b8f36cb974c0705aee878de4704778ce5f23f03c6d5411bb715cd25298b970d

See more details on using hashes here.

File details

Details for the file installm-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: installm-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 37.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0rc1

File hashes

Hashes for installm-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bfa16e009b2a6b913973e76f08ad6d624253e321c8688f0cb898f3e0dc75f623
MD5 f85a9723ff153976bad3e1b82949aaca
BLAKE2b-256 3c49f8a1010ecfcc56cb4c88a2124ce73c81c0b425200ee1ec9eb1fa30f462ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page