Skip to main content

One command. Full LLM stack. Zero config.

Project description

llmstack

One command. Full LLM stack. Zero config.

Stop wiring Docker containers. Start building AI apps.

PyPI CI License Python


llmstack spins up a production-grade LLM stack locally with a single command. It auto-detects your hardware, picks the optimal inference backend, and wires everything together.

pip install llmstack-cli
llmstack init
llmstack up

That's it. You now have a full LLM API running locally.

Architecture

                         llmstack up
                              |
                    +---------v----------+
                    |   Hardware Detect   |
                    |  NVIDIA / Apple / CPU|
                    +---------+----------+
                              |
              +-------+-------+-------+-------+
              |       |       |       |       |
         +----v--+ +--v---+ +v-----+ +v----+ +v--------+
         |Qdrant | |Redis | |Ollama| | TEI | | Gateway  |
         |Vector | |Cache | | or   | |Embed| | FastAPI  |
         |  DB   | |      | | vLLM | |     | | OpenAI   |
         +-------+ +------+ +------+ +-----+ |compatible|
              :6333   :6379   :11434   :8002  +----+-----+
                                                   |:8000
                                              +----v-----+
                                              |Prometheus |
                                              | + Grafana |
                                              +----------+
                                                   :8080

What you get

Layer Service Default Port
Inference Ollama / vLLM (auto) llama3.2 11434
Embeddings TEI / Ollama (auto) bge-m3 8002
Vector DB Qdrant - 6333
Cache Redis 256MB LRU 6379
API Gateway FastAPI (OpenAI-compatible) auth + rate limit 8000
Dashboard Grafana + Prometheus pre-built panels 8080

How it works

llmstack init       # Detects hardware, generates llmstack.yaml
                    # Picks optimal backend: vLLM for NVIDIA 16GB+, Ollama otherwise

llmstack up         # Boots services in order with health checks:
                    # Qdrant -> Redis -> Inference -> Embeddings -> Gateway -> Metrics

llmstack status     # Shows health of all running services
llmstack logs ollama # Stream inference logs
llmstack down       # Stops everything

Use the API

curl http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello!"}]}'

Works with any OpenAI-compatible client: LangChain, LlamaIndex, Vercel AI SDK, openai-python.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="YOUR_KEY")
response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Auto hardware detection

Your hardware Backend Why
NVIDIA GPU 16GB+ VRAM vLLM Max throughput, PagedAttention
NVIDIA GPU <16GB Ollama Lower memory overhead
Apple Silicon (M1-M4) Ollama Metal acceleration
CPU only Ollama GGUF quantized models

Presets

llmstack init --preset chat    # Minimal: inference + cache + gateway
llmstack init --preset rag     # + Qdrant + embeddings for RAG apps
llmstack init --preset agent   # 70B model + 16K context + longer timeouts

Configuration

One file: llmstack.yaml

version: "1"

models:
  chat:
    name: llama3.2
    backend: auto              # auto | ollama | vllm
    context_length: 8192
  embeddings:
    name: bge-m3

services:
  vectors:
    provider: qdrant
    port: 6333
  cache:
    provider: redis
    max_memory: 256mb

gateway:
  port: 8000
  auth: api_key
  rate_limit: 100/min
  cors: ["*"]

observe:
  metrics: true
  dashboard_port: 8080

CLI

Command Description
llmstack init [--preset] Create config with smart defaults
llmstack up [--attach] Start all services
llmstack down [--volumes] Stop and clean up
llmstack status Health check all services
llmstack logs <service> Stream service logs
llmstack doctor Diagnose system issues

Observability

When observe.metrics: true, llmstack boots Prometheus + Grafana with a pre-built dashboard:

  • Request rate per endpoint
  • Latency p50 / p99 histograms
  • Token throughput (input + output)
  • Error rate (4xx / 5xx)
  • Service health (up/down)

Access at http://localhost:8080 (login: admin / llmstack)

Plugins

Extend llmstack with new backends via pip:

pip install llmstack-cli-plugin-chromadb
# Now: vectors.provider: chromadb in llmstack.yaml

Create your own: implement ServiceBase, register via entry_points. See CONTRIBUTING.md.

Why llmstack?

llmstack Ollama Harbor AnythingLLM LiteLLM
One-command full stack Yes No (inference only) Partial Partial No (proxy only)
Auto hardware detection Yes No No No No
OpenAI-compatible API Yes Yes Varies No Yes
Built-in vector DB Yes No Config needed Bundled No
Built-in embeddings Yes No No Bundled No
Caching (Redis) Yes No No No No
Auth + rate limiting Yes No No Yes Yes
Observability dashboard Yes No Partial No Partial
Plugin ecosystem Yes No No No No
SSE streaming Yes Yes Yes Yes Yes

Tech stack

Requirements

  • Python 3.11+
  • Docker

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmstack_cli-0.1.0.tar.gz (29.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmstack_cli-0.1.0-py3-none-any.whl (43.1 kB view details)

Uploaded Python 3

File details

Details for the file llmstack_cli-0.1.0.tar.gz.

File metadata

  • Download URL: llmstack_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 29.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmstack_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1c6e634b5b9c9ce09d6bf46412b086f191e58b7a5903b1a8364d920ed253db2c
MD5 3213e9c1fbc589f3589a01a6eabf80f1
BLAKE2b-256 51577abebd22c9fed8740285ee0912e0dd97449bdf7f3c388ba88d34b166c473

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmstack_cli-0.1.0.tar.gz:

Publisher: release.yml on mara-werils/llmstack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llmstack_cli-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llmstack_cli-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 43.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmstack_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1e10a5263b17085a9ed64d76a4818f47c7b97c16da8e000dd462aeaccadc888c
MD5 1db0f4d17264c1188a1e71b71d48c1ae
BLAKE2b-256 99f54dbbcfcabbbbf75ac044f19e870ea5b058648268a0d93993e25aa2d16d33

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmstack_cli-0.1.0-py3-none-any.whl:

Publisher: release.yml on mara-werils/llmstack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page