One command. Full LLM stack. Zero config.
Project description
llmstack
One command. Full LLM stack. Zero config.
Stop wiring Docker containers. Start building AI apps.
Quick Start
pip install llmstack-cli
llmstack init --preset rag
llmstack up
That's it. You now have 7 services running: inference, embeddings, vector DB, cache, API gateway, Prometheus, and Grafana.
# Chat completion (OpenAI-compatible)
curl http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello!"}]}'
# Ingest a document for RAG
curl http://localhost:8000/v1/rag/ingest \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"text":"LLMStack is an open-source tool for...","source":"docs.txt"}'
# Query with RAG
curl http://localhost:8000/v1/rag/query \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"question":"What is LLMStack?"}'
Works with any OpenAI-compatible client: LangChain, LlamaIndex, Vercel AI SDK, openai-python.
Who is this for?
- AI app developers who want local inference + RAG without Docker boilerplate
- Teams who need an OpenAI-compatible API backed by local models
- Hobbyists running LLMs locally who want vector search, caching, and monitoring out of the box
- Anyone tired of writing 200+ lines of docker-compose.yml every time
Architecture
llmstack up
|
+---------v----------+
| Hardware Detect |
| NVIDIA / Apple / CPU|
+---------+----------+
|
+-------+-------+-------+-------+
| | | | |
+----v--+ +--v---+ +v-----+ +v----+ +v-----------+
|Qdrant | |Redis | |Ollama| | TEI | | Gateway |
|Vector | |Cache | | or | |Embed| | FastAPI |
| DB | |+ Rate| | vLLM | | | | + RAG |
| | | Limit| | | | | | + Cache |
+-------+ +------+ +------+ +-----+ | + Breaker |
:6333 :6379 :11434 :8002 | + Metrics |
+-----+------+
|:8000
+-----v------+
| Prometheus |
| + Grafana |
+------------+
:8080
| Layer | Service | What it does | Port |
|---|---|---|---|
| Inference | Ollama / vLLM (auto) | LLM chat completions | 11434 |
| Embeddings | TEI / Ollama (auto) | Text embeddings for RAG | 8002 |
| Vector DB | Qdrant | Document storage + semantic search | 6333 |
| Cache | Redis | Response cache + rate limit state | 6379 |
| API Gateway | FastAPI | Routing, auth, caching, RAG, circuit breaker | 8000 |
| Dashboard | Grafana + Prometheus | Request rate, latency, tokens, errors | 8080 |
Gateway Features
The gateway is not a simple proxy — it's a production-grade API layer:
Semantic Response Cache (Redis)
Request → SHA-256(model + messages) → Redis lookup
HIT → Return cached response (< 1ms)
MISS → Forward to inference → Cache result → Return
- Only caches deterministic requests (temperature <= 0.1)
- TTL-based expiration (default: 1 hour)
X-Cache: HIT/MISSresponse headers- Cache stats in
/healthz
Token Bucket Rate Limiter (Redis + Lua)
Request → Extract API key/IP → Redis EVALSHA (atomic Lua) → Allow/Reject
- Configurable:
100/min,10/sec,3600/hour - Per-API-key rate limiting with IP fallback
- Atomic Lua script prevents race conditions
- In-memory fallback if Redis is unavailable
- Standard
X-RateLimit-*andRetry-Afterheaders
Circuit Breaker (Inference Resilience)
CLOSED ──[5 failures]──> OPEN ──[timeout]──> HALF_OPEN ──[success]──> CLOSED
| |
└──[reject fast] └──[failure]──> OPEN (backoff x2)
- Prevents cascading failures when inference is down
- Exponential backoff on recovery timeout
- Fail-fast with
503 Service Unavailable - Metrics: state, failure count, rejections, time in state
RAG Pipeline (Qdrant + Embeddings)
Ingest: Document → Chunk (512 words, 64 overlap) → Embed → Qdrant
Query: Question → Embed → Qdrant search → Build context → LLM generate
POST /v1/rag/ingest— chunk, embed, and store documentsPOST /v1/rag/query— semantic search + augmented generation- Source citations in responses
- Streaming support via SSE
- Deterministic chunk IDs (deduplication)
Structured Logging
{"ts":"2026-05-07T14:23:01","level":"INFO","msg":"POST /v1/chat/completions 200 1234.5ms","request_id":"a1b2c3d4","method":"POST","path":"/v1/chat/completions","status":200,"duration_ms":1234.5,"client_ip":"10.0.0.1"}
X-Request-IDcorrelation headers- JSON structured output for log aggregation
- Configurable level and format
How it works
llmstack init # Detects hardware, generates llmstack.yaml
# Picks optimal backend: vLLM for NVIDIA 16GB+, Ollama otherwise
llmstack up # Boots services in order with health checks:
# Qdrant -> Redis -> Inference -> Embeddings -> Gateway -> Metrics
llmstack status # Shows health of all running services
llmstack chat # Interactive terminal chat with streaming
llmstack logs ollama # Stream inference logs
llmstack down # Stops everything
Auto hardware detection
| Your hardware | Backend | Why |
|---|---|---|
| NVIDIA GPU 16GB+ VRAM | vLLM | Max throughput, PagedAttention |
| NVIDIA GPU <16GB | Ollama | Lower memory overhead |
| Apple Silicon (M1-M4) | Ollama | Metal acceleration |
| CPU only | Ollama | GGUF quantized models |
Presets
llmstack init --preset chat # Minimal: inference + cache + gateway
llmstack init --preset rag # + Qdrant + embeddings for RAG apps
llmstack init --preset agent # 70B model + 16K context + longer timeouts
Configuration
One file: llmstack.yaml
version: "1"
models:
chat:
name: llama3.2
backend: auto # auto | ollama | vllm
context_length: 8192
embeddings:
name: bge-m3
services:
vectors:
provider: qdrant
port: 6333
cache:
provider: redis
max_memory: 256mb
gateway:
port: 8000
auth: api_key
rate_limit: 100/min
cors: ["*"]
observe:
metrics: true
dashboard_port: 8080
API Reference
OpenAI-compatible endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completion (streaming + non-streaming) |
/v1/embeddings |
POST | Text embeddings |
/v1/models |
GET | List available models |
RAG endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/rag/ingest |
POST | Ingest a document (chunk + embed + store) |
/v1/rag/query |
POST | Query with retrieval-augmented generation |
/v1/rag/documents/{source} |
DELETE | Delete documents by source |
/v1/rag/status |
GET | Collection statistics |
System endpoints
| Endpoint | Method | Description |
|---|---|---|
/healthz |
GET | Health check with circuit breaker + cache stats |
/metrics |
GET | Prometheus metrics |
Interactive Chat
llmstack chat
LLMStack Chat — model: llama3.2
Type 'exit' or Ctrl+C to quit. '/clear' to reset conversation.
You: What is quantum computing?
Assistant: Quantum computing uses quantum mechanical phenomena like
superposition and entanglement to process information...
You: /clear
Conversation cleared.
Export to Docker Compose
llmstack export
# Exported 7 services to docker-compose.yml
# Run with: docker compose up -d
Share the generated file with your team — no llmstack dependency required.
Use the API
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="YOUR_KEY")
# Chat completion
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
# Embeddings
embeddings = client.embeddings.create(
model="bge-m3",
input=["Hello world"]
)
import httpx
# RAG: Ingest documents
httpx.post("http://localhost:8000/v1/rag/ingest", json={
"text": open("whitepaper.txt").read(),
"source": "whitepaper.txt",
}, headers={"Authorization": "Bearer YOUR_KEY"})
# RAG: Query
response = httpx.post("http://localhost:8000/v1/rag/query", json={
"question": "What are the key findings?",
"top_k": 5,
}, headers={"Authorization": "Bearer YOUR_KEY"})
print(response.json()["answer"])
print(response.json()["sources"])
CLI
| Command | Description |
|---|---|
llmstack init [--preset] |
Create config with smart defaults |
llmstack up [--attach] |
Start all services |
llmstack down [--volumes] |
Stop and clean up |
llmstack status |
Health check all services |
llmstack chat [--model] |
Interactive terminal chat |
llmstack export [--output] |
Generate docker-compose.yml |
llmstack logs <service> |
Stream service logs |
llmstack doctor |
Diagnose system issues |
Observability
When observe.metrics: true, llmstack boots Prometheus + Grafana with a pre-built dashboard:
- Request rate per endpoint
- Latency p50 / p99 histograms
- Token throughput (input + output)
- Error rate (4xx / 5xx)
- Cache hit rate
- Circuit breaker state
- Rate limit rejections
Access at http://localhost:8080 (login: admin / llmstack)
Comparison
| llmstack | Ollama | LocalAI | AnythingLLM | LiteLLM | |
|---|---|---|---|---|---|
| One-command full stack | Yes | No | No | Partial | No |
| Built-in RAG pipeline | Yes | No | No | Bundled | No |
| Response caching | Yes | No | No | No | No |
| Circuit breaker | Yes | No | No | No | No |
| Rate limiting (Redis) | Yes | No | No | Yes | Yes |
| Auto hardware detection | Yes | No | No | No | No |
| OpenAI-compatible API | Yes | Yes | Yes | No | Yes |
| Built-in vector DB | Yes | No | No | Bundled | No |
| Observability dashboard | Yes | No | Partial | No | Partial |
| Plugin ecosystem | Yes | No | No | No | No |
Plugins
Extend llmstack with new backends via pip:
pip install llmstack-cli-plugin-chromadb
# Now: vectors.provider: chromadb in llmstack.yaml
Create your own: implement ServiceBase, register via entry_points. See CONTRIBUTING.md.
Tech stack
- CLI: Typer + Rich
- Config: Pydantic v2
- Gateway: FastAPI + Redis + Qdrant
- Containers: Docker SDK for Python
- Cache: Redis with semantic hashing
- Rate Limiting: Token bucket with Redis Lua scripts
- Resilience: Circuit breaker with exponential backoff
- Metrics: Prometheus + Grafana
Requirements
- Python 3.11+
- Docker
Contributing
See CONTRIBUTING.md for development setup and guidelines.
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmstack_cli-0.3.0.tar.gz.
File metadata
- Download URL: llmstack_cli-0.3.0.tar.gz
- Upload date:
- Size: 988.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87d516b32c125c905db492a07e63f2f80784942e147750bc4c84711e0d558fba
|
|
| MD5 |
8dc693e2a3fd3a031974d5ba2b34ced6
|
|
| BLAKE2b-256 |
1d82ec497753ac208fd6d4f885db0b1d1dd653efa13b0b2316288ca906c4ab73
|
Provenance
The following attestation bundles were made for llmstack_cli-0.3.0.tar.gz:
Publisher:
release.yml on mara-werils/llmstack
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmstack_cli-0.3.0.tar.gz -
Subject digest:
87d516b32c125c905db492a07e63f2f80784942e147750bc4c84711e0d558fba - Sigstore transparency entry: 1475317631
- Sigstore integration time:
-
Permalink:
mara-werils/llmstack@bb0bee9d77cfe5b4cc79fea30b5f57d50b5190d7 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/mara-werils
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bb0bee9d77cfe5b4cc79fea30b5f57d50b5190d7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file llmstack_cli-0.3.0-py3-none-any.whl.
File metadata
- Download URL: llmstack_cli-0.3.0-py3-none-any.whl
- Upload date:
- Size: 65.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a3028651b8bdabcb4d909bb907f1493998b289cebcb3638552ef566f13ea044
|
|
| MD5 |
da08381ad6cf6c501db772489aacd2a6
|
|
| BLAKE2b-256 |
ae4af6374df9ddf81a38818aff862406e58d0c46d2af6988aa3d805bfa04d012
|
Provenance
The following attestation bundles were made for llmstack_cli-0.3.0-py3-none-any.whl:
Publisher:
release.yml on mara-werils/llmstack
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmstack_cli-0.3.0-py3-none-any.whl -
Subject digest:
2a3028651b8bdabcb4d909bb907f1493998b289cebcb3638552ef566f13ea044 - Sigstore transparency entry: 1475317770
- Sigstore integration time:
-
Permalink:
mara-werils/llmstack@bb0bee9d77cfe5b4cc79fea30b5f57d50b5190d7 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/mara-werils
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bb0bee9d77cfe5b4cc79fea30b5f57d50b5190d7 -
Trigger Event:
push
-
Statement type: