Skip to main content

One command. Full LLM stack. Zero config.

Project description

llmstack

One command. Full LLM stack. Zero config.

Stop wiring Docker containers. Start building AI apps.

PyPI CI License Python Stars


llmstack demo

Quick Start

pip install llmstack-cli
llmstack init --preset rag
llmstack up

That's it. You now have 7 services running: inference, embeddings, vector DB, cache, API gateway, Prometheus, and Grafana.

# Test it immediately
curl http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello!"}]}'

Works with any OpenAI-compatible client: LangChain, LlamaIndex, Vercel AI SDK, openai-python.

Who is this for?

  • AI app developers who want local inference without Docker boilerplate
  • Teams who need an OpenAI-compatible API backed by local models
  • Hobbyists running LLMs locally who want vector search, caching, and monitoring out of the box
  • Anyone tired of writing 200+ lines of docker-compose.yml every time

What you get

                         llmstack up
                              |
                    +---------v----------+
                    |   Hardware Detect   |
                    |  NVIDIA / Apple / CPU|
                    +---------+----------+
                              |
              +-------+-------+-------+-------+
              |       |       |       |       |
         +----v--+ +--v---+ +v-----+ +v----+ +v--------+
         |Qdrant | |Redis | |Ollama| | TEI | | Gateway  |
         |Vector | |Cache | | or   | |Embed| | FastAPI  |
         |  DB   | |      | | vLLM | |     | | OpenAI   |
         +-------+ +------+ +------+ +-----+ |compatible|
              :6333   :6379   :11434   :8002  +----+-----+
                                                   |:8000
                                              +----v-----+
                                              |Prometheus |
                                              | + Grafana |
                                              +----------+
                                                   :8080
Layer Service Default Port
Inference Ollama / vLLM (auto) llama3.2 11434
Embeddings TEI / Ollama (auto) bge-m3 8002
Vector DB Qdrant - 6333
Cache Redis 256MB LRU 6379
API Gateway FastAPI (OpenAI-compatible) auth + rate limit 8000
Dashboard Grafana + Prometheus pre-built panels 8080

How it works

llmstack init       # Detects hardware, generates llmstack.yaml
                    # Picks optimal backend: vLLM for NVIDIA 16GB+, Ollama otherwise

llmstack up         # Boots services in order with health checks:
                    # Qdrant -> Redis -> Inference -> Embeddings -> Gateway -> Metrics

llmstack status     # Shows health of all running services
llmstack logs ollama # Stream inference logs
llmstack down       # Stops everything

Auto hardware detection

Your hardware Backend Why
NVIDIA GPU 16GB+ VRAM vLLM Max throughput, PagedAttention
NVIDIA GPU <16GB Ollama Lower memory overhead
Apple Silicon (M1-M4) Ollama Metal acceleration
CPU only Ollama GGUF quantized models

Presets

llmstack init --preset chat    # Minimal: inference + cache + gateway
llmstack init --preset rag     # + Qdrant + embeddings for RAG apps
llmstack init --preset agent   # 70B model + 16K context + longer timeouts

Configuration

One file: llmstack.yaml

version: "1"

models:
  chat:
    name: llama3.2
    backend: auto              # auto | ollama | vllm
    context_length: 8192
  embeddings:
    name: bge-m3

services:
  vectors:
    provider: qdrant
    port: 6333
  cache:
    provider: redis
    max_memory: 256mb

gateway:
  port: 8000
  auth: api_key
  rate_limit: 100/min
  cors: ["*"]

observe:
  metrics: true
  dashboard_port: 8080

Interactive Chat

llmstack chat
LLMStack Chat — model: llama3.2
Type 'exit' or Ctrl+C to quit. '/clear' to reset conversation.

You: What is quantum computing?
Assistant: Quantum computing uses quantum mechanical phenomena like
superposition and entanglement to process information...

You: /clear
Conversation cleared.

Streaming responses, conversation history, works with any model in your stack.

Export to Docker Compose

Don't want to install llmstack? Generate a standalone docker-compose.yml:

llmstack export
# Exported 7 services to docker-compose.yml
# Run with: docker compose up -d

Share the generated file with your team — no llmstack dependency required.

Use the API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="YOUR_KEY")
response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

CLI

Command Description
llmstack init [--preset] Create config with smart defaults
llmstack up [--attach] Start all services
llmstack down [--volumes] Stop and clean up
llmstack status Health check all services
llmstack chat [--model] Interactive terminal chat
llmstack export [--output] Generate docker-compose.yml
llmstack logs <service> Stream service logs
llmstack doctor Diagnose system issues

Observability

When observe.metrics: true, llmstack boots Prometheus + Grafana with a pre-built dashboard:

  • Request rate per endpoint
  • Latency p50 / p99 histograms
  • Token throughput (input + output)
  • Error rate (4xx / 5xx)
  • Service health (up/down)

Access at http://localhost:8080 (login: admin / llmstack)

Why not just Docker Compose?

Here's what llmstack replaces:

# Without llmstack: ~200 lines of docker-compose.yml
# You have to configure each service, write health checks,
# set up networking, manage GPU passthrough, create Prometheus
# scrape configs, provision Grafana dashboards...

# With llmstack:
llmstack init && llmstack up

Comparison

llmstack Ollama LocalAI AnythingLLM LiteLLM
One-command full stack Yes No (inference only) No Partial No (proxy only)
Auto hardware detection Yes No No No No
OpenAI-compatible API Yes Yes Yes No Yes
Built-in vector DB Yes No No Bundled No
Built-in embeddings Yes No No Bundled No
Caching (Redis) Yes No No No No
Auth + rate limiting Yes No No Yes Yes
Observability dashboard Yes No Partial No Partial
Plugin ecosystem Yes No No No No

Plugins

Extend llmstack with new backends via pip:

pip install llmstack-cli-plugin-chromadb
# Now: vectors.provider: chromadb in llmstack.yaml

Create your own: implement ServiceBase, register via entry_points. See CONTRIBUTING.md.

Tech stack

Requirements

  • Python 3.11+
  • Docker

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmstack_cli-0.2.0.tar.gz (971.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmstack_cli-0.2.0-py3-none-any.whl (48.7 kB view details)

Uploaded Python 3

File details

Details for the file llmstack_cli-0.2.0.tar.gz.

File metadata

  • Download URL: llmstack_cli-0.2.0.tar.gz
  • Upload date:
  • Size: 971.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmstack_cli-0.2.0.tar.gz
Algorithm Hash digest
SHA256 270a6d4da7e2f55c1f21ab97365143e455fc2d3321ad955a794b224a7d342697
MD5 f34865d34d7e188f4ae8983397e43b82
BLAKE2b-256 4fb30ec58b6f1e89f6068b33c7f338dedb27676b58c02a4aa83bd10a28784448

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmstack_cli-0.2.0.tar.gz:

Publisher: release.yml on mara-werils/llmstack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llmstack_cli-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: llmstack_cli-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 48.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmstack_cli-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6846c0ffe10fd8247fba67687a40303fd3e960411f55b444778c66cb3bb3494a
MD5 63b9f217b4a52ab161fc24a2ff0082e3
BLAKE2b-256 e8eaffd07e503e64fc96d1c6408534e53f922d3e658ae963b1cb87111a4ce788

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmstack_cli-0.2.0-py3-none-any.whl:

Publisher: release.yml on mara-werils/llmstack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page