One command. Full LLM stack. Zero config.

These details have not been verified by PyPI

Project description

llmstack

One command. Full LLM stack. Zero config.

Stop wiring Docker containers. Start building AI apps.

llmstack demo

Quick Start

pip install llmstack-cli
llmstack init --preset rag
llmstack up

That's it. You now have 7 services running: inference, embeddings, vector DB, cache, API gateway, Prometheus, and Grafana.

# Chat completion (OpenAI-compatible)
curl http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello!"}]}'

# Ingest a document for RAG
curl http://localhost:8000/v1/rag/ingest \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text":"LLMStack is an open-source tool for...","source":"docs.txt"}'

# Query with RAG
curl http://localhost:8000/v1/rag/query \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"question":"What is LLMStack?"}'

Works with any OpenAI-compatible client: LangChain, LlamaIndex, Vercel AI SDK, openai-python.

Who is this for?

AI app developers who want local inference + RAG without Docker boilerplate
Teams who need an OpenAI-compatible API backed by local models
Hobbyists running LLMs locally who want vector search, caching, and monitoring out of the box
Anyone tired of writing 200+ lines of docker-compose.yml every time

Architecture

                         llmstack up
                              |
                    +---------v----------+
                    |   Hardware Detect   |
                    |  NVIDIA / Apple / CPU|
                    +---------+----------+
                              |
              +-------+-------+-------+-------+
              |       |       |       |       |
         +----v--+ +--v---+ +v-----+ +v----+ +v-----------+
         |Qdrant | |Redis | |Ollama| | TEI | |  Gateway    |
         |Vector | |Cache | | or   | |Embed| |  FastAPI    |
         |  DB   | |+ Rate| | vLLM | |     | |  + RAG      |
         |       | | Limit| |      | |     | |  + Cache    |
         +-------+ +------+ +------+ +-----+ |  + Breaker  |
              :6333   :6379   :11434   :8002  |  + Metrics  |
                                              +-----+------+
                                                    |:8000
                                              +-----v------+
                                              | Prometheus  |
                                              |  + Grafana  |
                                              +------------+
                                                    :8080

Layer	Service	What it does	Port
Inference	Ollama / vLLM (auto)	LLM chat completions	11434
Embeddings	TEI / Ollama (auto)	Text embeddings for RAG	8002
Vector DB	Qdrant	Document storage + semantic search	6333
Cache	Redis	Response cache + rate limit state	6379
API Gateway	FastAPI	Routing, auth, caching, RAG, circuit breaker	8000
Dashboard	Grafana + Prometheus	Request rate, latency, tokens, errors	8080

Gateway Features

The gateway is not a simple proxy — it's a production-grade API layer:

Semantic Response Cache (Redis)

Request → SHA-256(model + messages) → Redis lookup
  HIT  → Return cached response (< 1ms)
  MISS → Forward to inference → Cache result → Return

Only caches deterministic requests (temperature <= 0.1)
TTL-based expiration (default: 1 hour)
X-Cache: HIT/MISS response headers
Cache stats in /healthz

Token Bucket Rate Limiter (Redis + Lua)

Request → Extract API key/IP → Redis EVALSHA (atomic Lua) → Allow/Reject

Configurable: 100/min, 10/sec, 3600/hour
Per-API-key rate limiting with IP fallback
Atomic Lua script prevents race conditions
In-memory fallback if Redis is unavailable
Standard X-RateLimit-* and Retry-After headers

Circuit Breaker (Inference Resilience)

CLOSED ──[5 failures]──> OPEN ──[timeout]──> HALF_OPEN ──[success]──> CLOSED
                           |                      |
                           └──[reject fast]       └──[failure]──> OPEN (backoff x2)

Prevents cascading failures when inference is down
Exponential backoff on recovery timeout
Fail-fast with 503 Service Unavailable
Metrics: state, failure count, rejections, time in state

RAG Pipeline (Qdrant + Embeddings)

Ingest: Document → Chunk (512 words, 64 overlap) → Embed → Qdrant
Query:  Question → Embed → Qdrant search → Build context → LLM generate

POST /v1/rag/ingest — chunk, embed, and store documents
POST /v1/rag/query — semantic search + augmented generation
Source citations in responses
Streaming support via SSE
Deterministic chunk IDs (deduplication)

Structured Logging

{"ts":"2026-05-07T14:23:01","level":"INFO","msg":"POST /v1/chat/completions 200 1234.5ms","request_id":"a1b2c3d4","method":"POST","path":"/v1/chat/completions","status":200,"duration_ms":1234.5,"client_ip":"10.0.0.1"}

X-Request-ID correlation headers
JSON structured output for log aggregation
Configurable level and format

How it works

llmstack init       # Detects hardware, generates llmstack.yaml
                    # Picks optimal backend: vLLM for NVIDIA 16GB+, Ollama otherwise

llmstack up         # Boots services in order with health checks:
                    # Qdrant -> Redis -> Inference -> Embeddings -> Gateway -> Metrics

llmstack status     # Shows health of all running services
llmstack chat       # Interactive terminal chat with streaming
llmstack logs ollama # Stream inference logs
llmstack down       # Stops everything

Auto hardware detection

Your hardware	Backend	Why
NVIDIA GPU 16GB+ VRAM	vLLM	Max throughput, PagedAttention
NVIDIA GPU <16GB	Ollama	Lower memory overhead
Apple Silicon (M1-M4)	Ollama	Metal acceleration
CPU only	Ollama	GGUF quantized models

Presets

llmstack init --preset chat    # Minimal: inference + cache + gateway
llmstack init --preset rag     # + Qdrant + embeddings for RAG apps
llmstack init --preset agent   # 70B model + 16K context + longer timeouts

Configuration

One file: llmstack.yaml

version: "1"

models:
  chat:
    name: llama3.2
    backend: auto              # auto | ollama | vllm
    context_length: 8192
  embeddings:
    name: bge-m3

services:
  vectors:
    provider: qdrant
    port: 6333
  cache:
    provider: redis
    max_memory: 256mb

gateway:
  port: 8000
  auth: api_key
  rate_limit: 100/min
  cors: ["*"]

observe:
  metrics: true
  dashboard_port: 8080

API Reference

OpenAI-compatible endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completion (streaming + non-streaming)
`/v1/embeddings`	POST	Text embeddings
`/v1/models`	GET	List available models

RAG endpoints

Endpoint	Method	Description
`/v1/rag/ingest`	POST	Ingest a document (chunk + embed + store)
`/v1/rag/query`	POST	Query with retrieval-augmented generation
`/v1/rag/documents/{source}`	DELETE	Delete documents by source
`/v1/rag/status`	GET	Collection statistics

System endpoints

Endpoint	Method	Description
`/healthz`	GET	Health check with circuit breaker + cache stats
`/metrics`	GET	Prometheus metrics

Interactive Chat

llmstack chat

LLMStack Chat — model: llama3.2
Type 'exit' or Ctrl+C to quit. '/clear' to reset conversation.

You: What is quantum computing?
Assistant: Quantum computing uses quantum mechanical phenomena like
superposition and entanglement to process information...

You: /clear
Conversation cleared.

Export to Docker Compose

llmstack export
# Exported 7 services to docker-compose.yml
# Run with: docker compose up -d

Share the generated file with your team — no llmstack dependency required.

Use the API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="YOUR_KEY")

# Chat completion
response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

# Embeddings
embeddings = client.embeddings.create(
    model="bge-m3",
    input=["Hello world"]
)

import httpx

# RAG: Ingest documents
httpx.post("http://localhost:8000/v1/rag/ingest", json={
    "text": open("whitepaper.txt").read(),
    "source": "whitepaper.txt",
}, headers={"Authorization": "Bearer YOUR_KEY"})

# RAG: Query
response = httpx.post("http://localhost:8000/v1/rag/query", json={
    "question": "What are the key findings?",
    "top_k": 5,
}, headers={"Authorization": "Bearer YOUR_KEY"})

print(response.json()["answer"])
print(response.json()["sources"])

CLI

Command	Description
`llmstack init [--preset]`	Create config with smart defaults
`llmstack up [--attach]`	Start all services
`llmstack down [--volumes]`	Stop and clean up
`llmstack status`	Health check all services
`llmstack chat [--model]`	Interactive terminal chat
`llmstack export [--output]`	Generate docker-compose.yml
`llmstack logs <service>`	Stream service logs
`llmstack doctor`	Diagnose system issues

Observability

When observe.metrics: true, llmstack boots Prometheus + Grafana with a pre-built dashboard:

Request rate per endpoint
Latency p50 / p99 histograms
Token throughput (input + output)
Error rate (4xx / 5xx)
Cache hit rate
Circuit breaker state
Rate limit rejections

Access at http://localhost:8080 (login: admin / llmstack)

Comparison

	llmstack	Ollama	LocalAI	AnythingLLM	LiteLLM
One-command full stack	Yes	No	No	Partial	No
Built-in RAG pipeline	Yes	No	No	Bundled	No
Response caching	Yes	No	No	No	No
Circuit breaker	Yes	No	No	No	No
Rate limiting (Redis)	Yes	No	No	Yes	Yes
Auto hardware detection	Yes	No	No	No	No
OpenAI-compatible API	Yes	Yes	Yes	No	Yes
Built-in vector DB	Yes	No	No	Bundled	No
Observability dashboard	Yes	No	Partial	No	Partial
Plugin ecosystem	Yes	No	No	No	No

Plugins

Extend llmstack with new backends via pip:

pip install llmstack-cli-plugin-chromadb
# Now: vectors.provider: chromadb in llmstack.yaml

Create your own: implement ServiceBase, register via entry_points. See CONTRIBUTING.md.

Tech stack

CLI: Typer + Rich
Config: Pydantic v2
Gateway: FastAPI + Redis + Qdrant
Containers: Docker SDK for Python
Cache: Redis with semantic hashing
Rate Limiting: Token bucket with Redis Lua scripts
Resilience: Circuit breaker with exponential backoff
Metrics: Prometheus + Grafana

Requirements

Python 3.11+
Docker

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

May 8, 2026

0.2.0

May 7, 2026

0.1.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmstack_cli-0.3.0.tar.gz (988.6 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmstack_cli-0.3.0-py3-none-any.whl (65.7 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file llmstack_cli-0.3.0.tar.gz.

File metadata

Download URL: llmstack_cli-0.3.0.tar.gz
Upload date: May 8, 2026
Size: 988.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmstack_cli-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`87d516b32c125c905db492a07e63f2f80784942e147750bc4c84711e0d558fba`
MD5	`8dc693e2a3fd3a031974d5ba2b34ced6`
BLAKE2b-256	`1d82ec497753ac208fd6d4f885db0b1d1dd653efa13b0b2316288ca906c4ab73`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmstack_cli-0.3.0.tar.gz:

Publisher: release.yml on mara-werils/llmstack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmstack_cli-0.3.0.tar.gz
- Subject digest: 87d516b32c125c905db492a07e63f2f80784942e147750bc4c84711e0d558fba
- Sigstore transparency entry: 1475317631
- Sigstore integration time: May 8, 2026
Source repository:
- Permalink: mara-werils/llmstack@bb0bee9d77cfe5b4cc79fea30b5f57d50b5190d7
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/mara-werils
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bb0bee9d77cfe5b4cc79fea30b5f57d50b5190d7
- Trigger Event: push

File details

Details for the file llmstack_cli-0.3.0-py3-none-any.whl.

File metadata

Download URL: llmstack_cli-0.3.0-py3-none-any.whl
Upload date: May 8, 2026
Size: 65.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmstack_cli-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2a3028651b8bdabcb4d909bb907f1493998b289cebcb3638552ef566f13ea044`
MD5	`da08381ad6cf6c501db772489aacd2a6`
BLAKE2b-256	`ae4af6374df9ddf81a38818aff862406e58d0c46d2af6988aa3d805bfa04d012`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmstack_cli-0.3.0-py3-none-any.whl:

Publisher: release.yml on mara-werils/llmstack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmstack_cli-0.3.0-py3-none-any.whl
- Subject digest: 2a3028651b8bdabcb4d909bb907f1493998b289cebcb3638552ef566f13ea044
- Sigstore transparency entry: 1475317770
- Sigstore integration time: May 8, 2026
Source repository:
- Permalink: mara-werils/llmstack@bb0bee9d77cfe5b4cc79fea30b5f57d50b5190d7
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/mara-werils
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bb0bee9d77cfe5b4cc79fea30b5f57d50b5190d7
- Trigger Event: push

llmstack-cli 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

llmstack

Quick Start

Who is this for?

Architecture

Gateway Features

Semantic Response Cache (Redis)

Token Bucket Rate Limiter (Redis + Lua)

Circuit Breaker (Inference Resilience)

RAG Pipeline (Qdrant + Embeddings)

Structured Logging

How it works

Auto hardware detection

Presets

Configuration

API Reference

OpenAI-compatible endpoints

RAG endpoints

System endpoints

Interactive Chat

Export to Docker Compose

Use the API

CLI

Observability

Comparison

Plugins

Tech stack

Requirements

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance