One command. Full LLM stack. Zero config.

These details have not been verified by PyPI

Project description

llmstack

One command. Full LLM stack. Zero config.

Stop wiring Docker containers. Start building AI apps.

llmstack spins up a production-grade LLM stack locally with a single command. It auto-detects your hardware, picks the optimal inference backend, and wires everything together.

pip install llmstack-cli
llmstack init
llmstack up

That's it. You now have a full LLM API running locally.

Architecture

                         llmstack up
                              |
                    +---------v----------+
                    |   Hardware Detect   |
                    |  NVIDIA / Apple / CPU|
                    +---------+----------+
                              |
              +-------+-------+-------+-------+
              |       |       |       |       |
         +----v--+ +--v---+ +v-----+ +v----+ +v--------+
         |Qdrant | |Redis | |Ollama| | TEI | | Gateway  |
         |Vector | |Cache | | or   | |Embed| | FastAPI  |
         |  DB   | |      | | vLLM | |     | | OpenAI   |
         +-------+ +------+ +------+ +-----+ |compatible|
              :6333   :6379   :11434   :8002  +----+-----+
                                                   |:8000
                                              +----v-----+
                                              |Prometheus |
                                              | + Grafana |
                                              +----------+
                                                   :8080

What you get

Layer	Service	Default	Port
Inference	Ollama / vLLM (auto)	llama3.2	11434
Embeddings	TEI / Ollama (auto)	bge-m3	8002
Vector DB	Qdrant	-	6333
Cache	Redis	256MB LRU	6379
API Gateway	FastAPI (OpenAI-compatible)	auth + rate limit	8000
Dashboard	Grafana + Prometheus	pre-built panels	8080

How it works

llmstack init       # Detects hardware, generates llmstack.yaml
                    # Picks optimal backend: vLLM for NVIDIA 16GB+, Ollama otherwise

llmstack up         # Boots services in order with health checks:
                    # Qdrant -> Redis -> Inference -> Embeddings -> Gateway -> Metrics

llmstack status     # Shows health of all running services
llmstack logs ollama # Stream inference logs
llmstack down       # Stops everything

Use the API

curl http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello!"}]}'

Works with any OpenAI-compatible client: LangChain, LlamaIndex, Vercel AI SDK, openai-python.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="YOUR_KEY")
response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Auto hardware detection

Your hardware	Backend	Why
NVIDIA GPU 16GB+ VRAM	vLLM	Max throughput, PagedAttention
NVIDIA GPU <16GB	Ollama	Lower memory overhead
Apple Silicon (M1-M4)	Ollama	Metal acceleration
CPU only	Ollama	GGUF quantized models

Presets

llmstack init --preset chat    # Minimal: inference + cache + gateway
llmstack init --preset rag     # + Qdrant + embeddings for RAG apps
llmstack init --preset agent   # 70B model + 16K context + longer timeouts

Configuration

One file: llmstack.yaml

version: "1"

models:
  chat:
    name: llama3.2
    backend: auto              # auto | ollama | vllm
    context_length: 8192
  embeddings:
    name: bge-m3

services:
  vectors:
    provider: qdrant
    port: 6333
  cache:
    provider: redis
    max_memory: 256mb

gateway:
  port: 8000
  auth: api_key
  rate_limit: 100/min
  cors: ["*"]

observe:
  metrics: true
  dashboard_port: 8080

CLI

Command	Description
`llmstack init [--preset]`	Create config with smart defaults
`llmstack up [--attach]`	Start all services
`llmstack down [--volumes]`	Stop and clean up
`llmstack status`	Health check all services
`llmstack logs <service>`	Stream service logs
`llmstack doctor`	Diagnose system issues

Observability

When observe.metrics: true, llmstack boots Prometheus + Grafana with a pre-built dashboard:

Request rate per endpoint
Latency p50 / p99 histograms
Token throughput (input + output)
Error rate (4xx / 5xx)
Service health (up/down)

Access at http://localhost:8080 (login: admin / llmstack)

Plugins

Extend llmstack with new backends via pip:

pip install llmstack-cli-plugin-chromadb
# Now: vectors.provider: chromadb in llmstack.yaml

Create your own: implement ServiceBase, register via entry_points. See CONTRIBUTING.md.

Why llmstack?

	llmstack	Ollama	Harbor	AnythingLLM	LiteLLM
One-command full stack	Yes	No (inference only)	Partial	Partial	No (proxy only)
Auto hardware detection	Yes	No	No	No	No
OpenAI-compatible API	Yes	Yes	Varies	No	Yes
Built-in vector DB	Yes	No	Config needed	Bundled	No
Built-in embeddings	Yes	No	No	Bundled	No
Caching (Redis)	Yes	No	No	No	No
Auth + rate limiting	Yes	No	No	Yes	Yes
Observability dashboard	Yes	No	Partial	No	Partial
Plugin ecosystem	Yes	No	No	No	No
SSE streaming	Yes	Yes	Yes	Yes	Yes

Tech stack

CLI: Typer + Rich
Config: Pydantic v2
Gateway: FastAPI
Containers: Docker SDK for Python
Metrics: Prometheus + Grafana

Requirements

Python 3.11+
Docker

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

May 8, 2026

0.2.0

May 7, 2026

This version

0.1.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmstack_cli-0.1.0.tar.gz (29.9 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmstack_cli-0.1.0-py3-none-any.whl (43.1 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file llmstack_cli-0.1.0.tar.gz.

File metadata

Download URL: llmstack_cli-0.1.0.tar.gz
Upload date: May 7, 2026
Size: 29.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmstack_cli-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1c6e634b5b9c9ce09d6bf46412b086f191e58b7a5903b1a8364d920ed253db2c`
MD5	`3213e9c1fbc589f3589a01a6eabf80f1`
BLAKE2b-256	`51577abebd22c9fed8740285ee0912e0dd97449bdf7f3c388ba88d34b166c473`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmstack_cli-0.1.0.tar.gz:

Publisher: release.yml on mara-werils/llmstack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmstack_cli-0.1.0.tar.gz
- Subject digest: 1c6e634b5b9c9ce09d6bf46412b086f191e58b7a5903b1a8364d920ed253db2c
- Sigstore transparency entry: 1461302163
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: mara-werils/llmstack@be193ee39a17517af9da1aa92a98a725beb9079e
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/mara-werils
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@be193ee39a17517af9da1aa92a98a725beb9079e
- Trigger Event: push

File details

Details for the file llmstack_cli-0.1.0-py3-none-any.whl.

File metadata

Download URL: llmstack_cli-0.1.0-py3-none-any.whl
Upload date: May 7, 2026
Size: 43.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmstack_cli-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1e10a5263b17085a9ed64d76a4818f47c7b97c16da8e000dd462aeaccadc888c`
MD5	`1db0f4d17264c1188a1e71b71d48c1ae`
BLAKE2b-256	`99f54dbbcfcabbbbf75ac044f19e870ea5b058648268a0d93993e25aa2d16d33`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmstack_cli-0.1.0-py3-none-any.whl:

Publisher: release.yml on mara-werils/llmstack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmstack_cli-0.1.0-py3-none-any.whl
- Subject digest: 1e10a5263b17085a9ed64d76a4818f47c7b97c16da8e000dd462aeaccadc888c
- Sigstore transparency entry: 1461303252
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: mara-werils/llmstack@be193ee39a17517af9da1aa92a98a725beb9079e
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/mara-werils
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@be193ee39a17517af9da1aa92a98a725beb9079e
- Trigger Event: push

llmstack-cli 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

llmstack

Architecture

What you get

How it works

Use the API

Auto hardware detection

Presets

Configuration

CLI

Observability

Plugins

Why llmstack?

Tech stack

Requirements

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance