One command. Full LLM stack. Zero config.
Project description
llmstack
One command. Full LLM stack. Zero config.
Stop wiring Docker containers. Start building AI apps.
Quick Start
pip install llmstack-cli
llmstack init --preset rag
llmstack up
That's it. You now have 7 services running: inference, embeddings, vector DB, cache, API gateway, Prometheus, and Grafana.
# Test it immediately
curl http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello!"}]}'
Works with any OpenAI-compatible client: LangChain, LlamaIndex, Vercel AI SDK, openai-python.
Who is this for?
- AI app developers who want local inference without Docker boilerplate
- Teams who need an OpenAI-compatible API backed by local models
- Hobbyists running LLMs locally who want vector search, caching, and monitoring out of the box
- Anyone tired of writing 200+ lines of docker-compose.yml every time
What you get
llmstack up
|
+---------v----------+
| Hardware Detect |
| NVIDIA / Apple / CPU|
+---------+----------+
|
+-------+-------+-------+-------+
| | | | |
+----v--+ +--v---+ +v-----+ +v----+ +v--------+
|Qdrant | |Redis | |Ollama| | TEI | | Gateway |
|Vector | |Cache | | or | |Embed| | FastAPI |
| DB | | | | vLLM | | | | OpenAI |
+-------+ +------+ +------+ +-----+ |compatible|
:6333 :6379 :11434 :8002 +----+-----+
|:8000
+----v-----+
|Prometheus |
| + Grafana |
+----------+
:8080
| Layer | Service | Default | Port |
|---|---|---|---|
| Inference | Ollama / vLLM (auto) | llama3.2 | 11434 |
| Embeddings | TEI / Ollama (auto) | bge-m3 | 8002 |
| Vector DB | Qdrant | - | 6333 |
| Cache | Redis | 256MB LRU | 6379 |
| API Gateway | FastAPI (OpenAI-compatible) | auth + rate limit | 8000 |
| Dashboard | Grafana + Prometheus | pre-built panels | 8080 |
How it works
llmstack init # Detects hardware, generates llmstack.yaml
# Picks optimal backend: vLLM for NVIDIA 16GB+, Ollama otherwise
llmstack up # Boots services in order with health checks:
# Qdrant -> Redis -> Inference -> Embeddings -> Gateway -> Metrics
llmstack status # Shows health of all running services
llmstack logs ollama # Stream inference logs
llmstack down # Stops everything
Auto hardware detection
| Your hardware | Backend | Why |
|---|---|---|
| NVIDIA GPU 16GB+ VRAM | vLLM | Max throughput, PagedAttention |
| NVIDIA GPU <16GB | Ollama | Lower memory overhead |
| Apple Silicon (M1-M4) | Ollama | Metal acceleration |
| CPU only | Ollama | GGUF quantized models |
Presets
llmstack init --preset chat # Minimal: inference + cache + gateway
llmstack init --preset rag # + Qdrant + embeddings for RAG apps
llmstack init --preset agent # 70B model + 16K context + longer timeouts
Configuration
One file: llmstack.yaml
version: "1"
models:
chat:
name: llama3.2
backend: auto # auto | ollama | vllm
context_length: 8192
embeddings:
name: bge-m3
services:
vectors:
provider: qdrant
port: 6333
cache:
provider: redis
max_memory: 256mb
gateway:
port: 8000
auth: api_key
rate_limit: 100/min
cors: ["*"]
observe:
metrics: true
dashboard_port: 8080
Interactive Chat
llmstack chat
LLMStack Chat — model: llama3.2
Type 'exit' or Ctrl+C to quit. '/clear' to reset conversation.
You: What is quantum computing?
Assistant: Quantum computing uses quantum mechanical phenomena like
superposition and entanglement to process information...
You: /clear
Conversation cleared.
Streaming responses, conversation history, works with any model in your stack.
Export to Docker Compose
Don't want to install llmstack? Generate a standalone docker-compose.yml:
llmstack export
# Exported 7 services to docker-compose.yml
# Run with: docker compose up -d
Share the generated file with your team — no llmstack dependency required.
Use the API
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="YOUR_KEY")
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
CLI
| Command | Description |
|---|---|
llmstack init [--preset] |
Create config with smart defaults |
llmstack up [--attach] |
Start all services |
llmstack down [--volumes] |
Stop and clean up |
llmstack status |
Health check all services |
llmstack chat [--model] |
Interactive terminal chat |
llmstack export [--output] |
Generate docker-compose.yml |
llmstack logs <service> |
Stream service logs |
llmstack doctor |
Diagnose system issues |
Observability
When observe.metrics: true, llmstack boots Prometheus + Grafana with a pre-built dashboard:
- Request rate per endpoint
- Latency p50 / p99 histograms
- Token throughput (input + output)
- Error rate (4xx / 5xx)
- Service health (up/down)
Access at http://localhost:8080 (login: admin / llmstack)
Why not just Docker Compose?
Here's what llmstack replaces:
# Without llmstack: ~200 lines of docker-compose.yml
# You have to configure each service, write health checks,
# set up networking, manage GPU passthrough, create Prometheus
# scrape configs, provision Grafana dashboards...
# With llmstack:
llmstack init && llmstack up
Comparison
| llmstack | Ollama | LocalAI | AnythingLLM | LiteLLM | |
|---|---|---|---|---|---|
| One-command full stack | Yes | No (inference only) | No | Partial | No (proxy only) |
| Auto hardware detection | Yes | No | No | No | No |
| OpenAI-compatible API | Yes | Yes | Yes | No | Yes |
| Built-in vector DB | Yes | No | No | Bundled | No |
| Built-in embeddings | Yes | No | No | Bundled | No |
| Caching (Redis) | Yes | No | No | No | No |
| Auth + rate limiting | Yes | No | No | Yes | Yes |
| Observability dashboard | Yes | No | Partial | No | Partial |
| Plugin ecosystem | Yes | No | No | No | No |
Plugins
Extend llmstack with new backends via pip:
pip install llmstack-cli-plugin-chromadb
# Now: vectors.provider: chromadb in llmstack.yaml
Create your own: implement ServiceBase, register via entry_points. See CONTRIBUTING.md.
Tech stack
- CLI: Typer + Rich
- Config: Pydantic v2
- Gateway: FastAPI
- Containers: Docker SDK for Python
- Metrics: Prometheus + Grafana
Requirements
- Python 3.11+
- Docker
Contributing
See CONTRIBUTING.md for development setup and guidelines.
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmstack_cli-0.2.0.tar.gz.
File metadata
- Download URL: llmstack_cli-0.2.0.tar.gz
- Upload date:
- Size: 971.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
270a6d4da7e2f55c1f21ab97365143e455fc2d3321ad955a794b224a7d342697
|
|
| MD5 |
f34865d34d7e188f4ae8983397e43b82
|
|
| BLAKE2b-256 |
4fb30ec58b6f1e89f6068b33c7f338dedb27676b58c02a4aa83bd10a28784448
|
Provenance
The following attestation bundles were made for llmstack_cli-0.2.0.tar.gz:
Publisher:
release.yml on mara-werils/llmstack
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmstack_cli-0.2.0.tar.gz -
Subject digest:
270a6d4da7e2f55c1f21ab97365143e455fc2d3321ad955a794b224a7d342697 - Sigstore transparency entry: 1462042407
- Sigstore integration time:
-
Permalink:
mara-werils/llmstack@e992f28b5448f0aaf08663aac6ea58bb027dc3f7 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/mara-werils
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e992f28b5448f0aaf08663aac6ea58bb027dc3f7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file llmstack_cli-0.2.0-py3-none-any.whl.
File metadata
- Download URL: llmstack_cli-0.2.0-py3-none-any.whl
- Upload date:
- Size: 48.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6846c0ffe10fd8247fba67687a40303fd3e960411f55b444778c66cb3bb3494a
|
|
| MD5 |
63b9f217b4a52ab161fc24a2ff0082e3
|
|
| BLAKE2b-256 |
e8eaffd07e503e64fc96d1c6408534e53f922d3e658ae963b1cb87111a4ce788
|
Provenance
The following attestation bundles were made for llmstack_cli-0.2.0-py3-none-any.whl:
Publisher:
release.yml on mara-werils/llmstack
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmstack_cli-0.2.0-py3-none-any.whl -
Subject digest:
6846c0ffe10fd8247fba67687a40303fd3e960411f55b444778c66cb3bb3494a - Sigstore transparency entry: 1462042481
- Sigstore integration time:
-
Permalink:
mara-werils/llmstack@e992f28b5448f0aaf08663aac6ea58bb027dc3f7 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/mara-werils
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e992f28b5448f0aaf08663aac6ea58bb027dc3f7 -
Trigger Event:
push
-
Statement type: