One command. Full LLM stack. Zero config.
Project description
llmstack
One command. Full LLM stack. Zero config.
Stop wiring Docker containers. Start building AI apps.
llmstack spins up a production-grade LLM stack locally with a single command. It auto-detects your hardware, picks the optimal inference backend, and wires everything together.
pip install llmstack-cli
llmstack init
llmstack up
That's it. You now have a full LLM API running locally.
Architecture
llmstack up
|
+---------v----------+
| Hardware Detect |
| NVIDIA / Apple / CPU|
+---------+----------+
|
+-------+-------+-------+-------+
| | | | |
+----v--+ +--v---+ +v-----+ +v----+ +v--------+
|Qdrant | |Redis | |Ollama| | TEI | | Gateway |
|Vector | |Cache | | or | |Embed| | FastAPI |
| DB | | | | vLLM | | | | OpenAI |
+-------+ +------+ +------+ +-----+ |compatible|
:6333 :6379 :11434 :8002 +----+-----+
|:8000
+----v-----+
|Prometheus |
| + Grafana |
+----------+
:8080
What you get
| Layer | Service | Default | Port |
|---|---|---|---|
| Inference | Ollama / vLLM (auto) | llama3.2 | 11434 |
| Embeddings | TEI / Ollama (auto) | bge-m3 | 8002 |
| Vector DB | Qdrant | - | 6333 |
| Cache | Redis | 256MB LRU | 6379 |
| API Gateway | FastAPI (OpenAI-compatible) | auth + rate limit | 8000 |
| Dashboard | Grafana + Prometheus | pre-built panels | 8080 |
How it works
llmstack init # Detects hardware, generates llmstack.yaml
# Picks optimal backend: vLLM for NVIDIA 16GB+, Ollama otherwise
llmstack up # Boots services in order with health checks:
# Qdrant -> Redis -> Inference -> Embeddings -> Gateway -> Metrics
llmstack status # Shows health of all running services
llmstack logs ollama # Stream inference logs
llmstack down # Stops everything
Use the API
curl http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello!"}]}'
Works with any OpenAI-compatible client: LangChain, LlamaIndex, Vercel AI SDK, openai-python.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="YOUR_KEY")
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
Auto hardware detection
| Your hardware | Backend | Why |
|---|---|---|
| NVIDIA GPU 16GB+ VRAM | vLLM | Max throughput, PagedAttention |
| NVIDIA GPU <16GB | Ollama | Lower memory overhead |
| Apple Silicon (M1-M4) | Ollama | Metal acceleration |
| CPU only | Ollama | GGUF quantized models |
Presets
llmstack init --preset chat # Minimal: inference + cache + gateway
llmstack init --preset rag # + Qdrant + embeddings for RAG apps
llmstack init --preset agent # 70B model + 16K context + longer timeouts
Configuration
One file: llmstack.yaml
version: "1"
models:
chat:
name: llama3.2
backend: auto # auto | ollama | vllm
context_length: 8192
embeddings:
name: bge-m3
services:
vectors:
provider: qdrant
port: 6333
cache:
provider: redis
max_memory: 256mb
gateway:
port: 8000
auth: api_key
rate_limit: 100/min
cors: ["*"]
observe:
metrics: true
dashboard_port: 8080
CLI
| Command | Description |
|---|---|
llmstack init [--preset] |
Create config with smart defaults |
llmstack up [--attach] |
Start all services |
llmstack down [--volumes] |
Stop and clean up |
llmstack status |
Health check all services |
llmstack logs <service> |
Stream service logs |
llmstack doctor |
Diagnose system issues |
Observability
When observe.metrics: true, llmstack boots Prometheus + Grafana with a pre-built dashboard:
- Request rate per endpoint
- Latency p50 / p99 histograms
- Token throughput (input + output)
- Error rate (4xx / 5xx)
- Service health (up/down)
Access at http://localhost:8080 (login: admin / llmstack)
Plugins
Extend llmstack with new backends via pip:
pip install llmstack-cli-plugin-chromadb
# Now: vectors.provider: chromadb in llmstack.yaml
Create your own: implement ServiceBase, register via entry_points. See CONTRIBUTING.md.
Why llmstack?
| llmstack | Ollama | Harbor | AnythingLLM | LiteLLM | |
|---|---|---|---|---|---|
| One-command full stack | Yes | No (inference only) | Partial | Partial | No (proxy only) |
| Auto hardware detection | Yes | No | No | No | No |
| OpenAI-compatible API | Yes | Yes | Varies | No | Yes |
| Built-in vector DB | Yes | No | Config needed | Bundled | No |
| Built-in embeddings | Yes | No | No | Bundled | No |
| Caching (Redis) | Yes | No | No | No | No |
| Auth + rate limiting | Yes | No | No | Yes | Yes |
| Observability dashboard | Yes | No | Partial | No | Partial |
| Plugin ecosystem | Yes | No | No | No | No |
| SSE streaming | Yes | Yes | Yes | Yes | Yes |
Tech stack
- CLI: Typer + Rich
- Config: Pydantic v2
- Gateway: FastAPI
- Containers: Docker SDK for Python
- Metrics: Prometheus + Grafana
Requirements
- Python 3.11+
- Docker
Contributing
See CONTRIBUTING.md for development setup and guidelines.
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmstack_cli-0.1.0.tar.gz.
File metadata
- Download URL: llmstack_cli-0.1.0.tar.gz
- Upload date:
- Size: 29.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c6e634b5b9c9ce09d6bf46412b086f191e58b7a5903b1a8364d920ed253db2c
|
|
| MD5 |
3213e9c1fbc589f3589a01a6eabf80f1
|
|
| BLAKE2b-256 |
51577abebd22c9fed8740285ee0912e0dd97449bdf7f3c388ba88d34b166c473
|
Provenance
The following attestation bundles were made for llmstack_cli-0.1.0.tar.gz:
Publisher:
release.yml on mara-werils/llmstack
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmstack_cli-0.1.0.tar.gz -
Subject digest:
1c6e634b5b9c9ce09d6bf46412b086f191e58b7a5903b1a8364d920ed253db2c - Sigstore transparency entry: 1461302163
- Sigstore integration time:
-
Permalink:
mara-werils/llmstack@be193ee39a17517af9da1aa92a98a725beb9079e -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/mara-werils
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@be193ee39a17517af9da1aa92a98a725beb9079e -
Trigger Event:
push
-
Statement type:
File details
Details for the file llmstack_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llmstack_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 43.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e10a5263b17085a9ed64d76a4818f47c7b97c16da8e000dd462aeaccadc888c
|
|
| MD5 |
1db0f4d17264c1188a1e71b71d48c1ae
|
|
| BLAKE2b-256 |
99f54dbbcfcabbbbf75ac044f19e870ea5b058648268a0d93993e25aa2d16d33
|
Provenance
The following attestation bundles were made for llmstack_cli-0.1.0-py3-none-any.whl:
Publisher:
release.yml on mara-werils/llmstack
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmstack_cli-0.1.0-py3-none-any.whl -
Subject digest:
1e10a5263b17085a9ed64d76a4818f47c7b97c16da8e000dd462aeaccadc888c - Sigstore transparency entry: 1461303252
- Sigstore integration time:
-
Permalink:
mara-werils/llmstack@be193ee39a17517af9da1aa92a98a725beb9079e -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/mara-werils
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@be193ee39a17517af9da1aa92a98a725beb9079e -
Trigger Event:
push
-
Statement type: