One command. Full LLM stack. Zero config.

These details have not been verified by PyPI

Project description

llmstack

One command. Full LLM stack. Zero config.

Stop wiring Docker containers. Start building AI apps.

llmstack demo

Quick Start

pip install llmstack-cli
llmstack init --preset rag
llmstack up

That's it. You now have 7 services running: inference, embeddings, vector DB, cache, API gateway, Prometheus, and Grafana.

# Test it immediately
curl http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello!"}]}'

Works with any OpenAI-compatible client: LangChain, LlamaIndex, Vercel AI SDK, openai-python.

Who is this for?

AI app developers who want local inference without Docker boilerplate
Teams who need an OpenAI-compatible API backed by local models
Hobbyists running LLMs locally who want vector search, caching, and monitoring out of the box
Anyone tired of writing 200+ lines of docker-compose.yml every time

What you get

                         llmstack up
                              |
                    +---------v----------+
                    |   Hardware Detect   |
                    |  NVIDIA / Apple / CPU|
                    +---------+----------+
                              |
              +-------+-------+-------+-------+
              |       |       |       |       |
         +----v--+ +--v---+ +v-----+ +v----+ +v--------+
         |Qdrant | |Redis | |Ollama| | TEI | | Gateway  |
         |Vector | |Cache | | or   | |Embed| | FastAPI  |
         |  DB   | |      | | vLLM | |     | | OpenAI   |
         +-------+ +------+ +------+ +-----+ |compatible|
              :6333   :6379   :11434   :8002  +----+-----+
                                                   |:8000
                                              +----v-----+
                                              |Prometheus |
                                              | + Grafana |
                                              +----------+
                                                   :8080

Layer	Service	Default	Port
Inference	Ollama / vLLM (auto)	llama3.2	11434
Embeddings	TEI / Ollama (auto)	bge-m3	8002
Vector DB	Qdrant	-	6333
Cache	Redis	256MB LRU	6379
API Gateway	FastAPI (OpenAI-compatible)	auth + rate limit	8000
Dashboard	Grafana + Prometheus	pre-built panels	8080

How it works

llmstack init       # Detects hardware, generates llmstack.yaml
                    # Picks optimal backend: vLLM for NVIDIA 16GB+, Ollama otherwise

llmstack up         # Boots services in order with health checks:
                    # Qdrant -> Redis -> Inference -> Embeddings -> Gateway -> Metrics

llmstack status     # Shows health of all running services
llmstack logs ollama # Stream inference logs
llmstack down       # Stops everything

Auto hardware detection

Your hardware	Backend	Why
NVIDIA GPU 16GB+ VRAM	vLLM	Max throughput, PagedAttention
NVIDIA GPU <16GB	Ollama	Lower memory overhead
Apple Silicon (M1-M4)	Ollama	Metal acceleration
CPU only	Ollama	GGUF quantized models

Presets

llmstack init --preset chat    # Minimal: inference + cache + gateway
llmstack init --preset rag     # + Qdrant + embeddings for RAG apps
llmstack init --preset agent   # 70B model + 16K context + longer timeouts

Configuration

One file: llmstack.yaml

version: "1"

models:
  chat:
    name: llama3.2
    backend: auto              # auto | ollama | vllm
    context_length: 8192
  embeddings:
    name: bge-m3

services:
  vectors:
    provider: qdrant
    port: 6333
  cache:
    provider: redis
    max_memory: 256mb

gateway:
  port: 8000
  auth: api_key
  rate_limit: 100/min
  cors: ["*"]

observe:
  metrics: true
  dashboard_port: 8080

Interactive Chat

llmstack chat

LLMStack Chat — model: llama3.2
Type 'exit' or Ctrl+C to quit. '/clear' to reset conversation.

You: What is quantum computing?
Assistant: Quantum computing uses quantum mechanical phenomena like
superposition and entanglement to process information...

You: /clear
Conversation cleared.

Streaming responses, conversation history, works with any model in your stack.

Export to Docker Compose

Don't want to install llmstack? Generate a standalone docker-compose.yml:

llmstack export
# Exported 7 services to docker-compose.yml
# Run with: docker compose up -d

Share the generated file with your team — no llmstack dependency required.

Use the API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="YOUR_KEY")
response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

CLI

Command	Description
`llmstack init [--preset]`	Create config with smart defaults
`llmstack up [--attach]`	Start all services
`llmstack down [--volumes]`	Stop and clean up
`llmstack status`	Health check all services
`llmstack chat [--model]`	Interactive terminal chat
`llmstack export [--output]`	Generate docker-compose.yml
`llmstack logs <service>`	Stream service logs
`llmstack doctor`	Diagnose system issues

Observability

When observe.metrics: true, llmstack boots Prometheus + Grafana with a pre-built dashboard:

Request rate per endpoint
Latency p50 / p99 histograms
Token throughput (input + output)
Error rate (4xx / 5xx)
Service health (up/down)

Access at http://localhost:8080 (login: admin / llmstack)

Why not just Docker Compose?

Here's what llmstack replaces:

# Without llmstack: ~200 lines of docker-compose.yml
# You have to configure each service, write health checks,
# set up networking, manage GPU passthrough, create Prometheus
# scrape configs, provision Grafana dashboards...

# With llmstack:
llmstack init && llmstack up

Comparison

	llmstack	Ollama	LocalAI	AnythingLLM	LiteLLM
One-command full stack	Yes	No (inference only)	No	Partial	No (proxy only)
Auto hardware detection	Yes	No	No	No	No
OpenAI-compatible API	Yes	Yes	Yes	No	Yes
Built-in vector DB	Yes	No	No	Bundled	No
Built-in embeddings	Yes	No	No	Bundled	No
Caching (Redis)	Yes	No	No	No	No
Auth + rate limiting	Yes	No	No	Yes	Yes
Observability dashboard	Yes	No	Partial	No	Partial
Plugin ecosystem	Yes	No	No	No	No

Plugins

Extend llmstack with new backends via pip:

pip install llmstack-cli-plugin-chromadb
# Now: vectors.provider: chromadb in llmstack.yaml

Create your own: implement ServiceBase, register via entry_points. See CONTRIBUTING.md.

Tech stack

CLI: Typer + Rich
Config: Pydantic v2
Gateway: FastAPI
Containers: Docker SDK for Python
Metrics: Prometheus + Grafana

Requirements

Python 3.11+
Docker

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

May 8, 2026

This version

0.2.0

May 7, 2026

0.1.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmstack_cli-0.2.0.tar.gz (971.9 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmstack_cli-0.2.0-py3-none-any.whl (48.7 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file llmstack_cli-0.2.0.tar.gz.

File metadata

Download URL: llmstack_cli-0.2.0.tar.gz
Upload date: May 7, 2026
Size: 971.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmstack_cli-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`270a6d4da7e2f55c1f21ab97365143e455fc2d3321ad955a794b224a7d342697`
MD5	`f34865d34d7e188f4ae8983397e43b82`
BLAKE2b-256	`4fb30ec58b6f1e89f6068b33c7f338dedb27676b58c02a4aa83bd10a28784448`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmstack_cli-0.2.0.tar.gz:

Publisher: release.yml on mara-werils/llmstack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmstack_cli-0.2.0.tar.gz
- Subject digest: 270a6d4da7e2f55c1f21ab97365143e455fc2d3321ad955a794b224a7d342697
- Sigstore transparency entry: 1462042407
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: mara-werils/llmstack@e992f28b5448f0aaf08663aac6ea58bb027dc3f7
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/mara-werils
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e992f28b5448f0aaf08663aac6ea58bb027dc3f7
- Trigger Event: push

File details

Details for the file llmstack_cli-0.2.0-py3-none-any.whl.

File metadata

Download URL: llmstack_cli-0.2.0-py3-none-any.whl
Upload date: May 7, 2026
Size: 48.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmstack_cli-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6846c0ffe10fd8247fba67687a40303fd3e960411f55b444778c66cb3bb3494a`
MD5	`63b9f217b4a52ab161fc24a2ff0082e3`
BLAKE2b-256	`e8eaffd07e503e64fc96d1c6408534e53f922d3e658ae963b1cb87111a4ce788`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmstack_cli-0.2.0-py3-none-any.whl:

Publisher: release.yml on mara-werils/llmstack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmstack_cli-0.2.0-py3-none-any.whl
- Subject digest: 6846c0ffe10fd8247fba67687a40303fd3e960411f55b444778c66cb3bb3494a
- Sigstore transparency entry: 1462042481
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: mara-werils/llmstack@e992f28b5448f0aaf08663aac6ea58bb027dc3f7
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/mara-werils
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e992f28b5448f0aaf08663aac6ea58bb027dc3f7
- Trigger Event: push

llmstack-cli 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

llmstack

Quick Start

Who is this for?

What you get

How it works

Auto hardware detection

Presets

Configuration

Interactive Chat

Export to Docker Compose

Use the API

CLI

Observability

Why not just Docker Compose?

Comparison

Plugins

Tech stack

Requirements

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance