Interactive CLI chat client for vLLM inference servers with persistent sessions and automatic context management
Project description
Zorac - Self-Hosted Local LLM Chat Client
A fun terminal chat client for running local LLMs on consumer hardware. Chat with powerful AI models like Mistral-24B privately on your own RTX 4090/3090 - no cloud, no costs, complete privacy.
Perfect for developers who want a self-hosted ChatGPT alternative running on their gaming PC or homelab server. Also good for local AI coding assistants, agentic workflows and agent development.
Named after ZORAC, the intelligent Ganymean computer from James P. Hogan's The Gentle Giants of Ganymede.
Why Self-Host Your LLM?
- Zero ongoing costs - No API fees, run unlimited queries
- Complete privacy - Your data never leaves your machine
- Low latency - Sub-second responses on local hardware
- Use existing hardware - Your gaming GPU works great for AI
- Full control - Customize models, parameters, and behavior
- Work offline - No internet required after initial setup
Features
- Interactive CLI - Natural conversation flow with continuous input prompts
- Rich Terminal UI - Beautiful formatted output with optimized readability
- Left-aligned content with 60% width constraint for comfortable reading
- Syntax-highlighted code blocks and formatted markdown
- Clean, modern design without unnecessary clutter
- Streaming Responses - Real-time token streaming with live markdown display
- Persistent Sessions - Automatically saves and restores conversation history
- Smart Context Management - Automatically summarizes old messages when approaching token limits
- Token Tracking - Real-time monitoring of token usage with tiktoken
- Performance Metrics - Displays tokens/second, response time, and resource usage
- Configurable - Adjust all parameters via
.env, config file, or runtime commands
Demo
Rich Terminal UI with Live Streaming
Interactive chat with real-time streaming responses, markdown rendering, and performance metrics
Token Management & Commands
Built-in commands for session management and token tracking
Quick Start
1. Install Zorac
Homebrew (macOS/Linux):
brew tap chris-colinsky/zorac
brew install zorac
pip/pipx (All Platforms):
# Using pipx (recommended - isolated environment)
pipx install zorac
# Using pip
pip install zorac
# Using uv
uv tool install zorac
Windows Users: Use WSL (Windows Subsystem for Linux) and follow the Linux/pip instructions.
2. Set Up vLLM Server
You need a vLLM inference server running. See SERVER_SETUP.md for complete setup instructions.
3. Configure & Run
First Run:
When you start Zorac for the first time, you'll be greeted with a setup wizard:
$ zorac
███████╗ ██████╗ ██████╗ █████╗ ██████╗
╚══███╔╝██╔═══██╗██╔══██╗██╔══██╗██╔════╝
███╔╝ ██║ ██║██████╔╝███████║██║
███╔╝ ██║ ██║██╔══██╗██╔══██║██║
███████╗╚██████╔╝██║ ██║██║ ██║╚██████╗
╚══════╝ ╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝
intelligence running on localhost
────────────────────── Welcome to Zorac! ──────────────────────
This appears to be your first time running Zorac.
Let's configure your vLLM server connection.
Server Configuration:
Default: http://localhost:8000/v1
vLLM Server URL (or press Enter for default):
Default: stelterlab/Mistral-Small-24B-Instruct-2501-AWQ
Model name (or press Enter for default):
✓ Configuration saved to ~/.zorac/config.json
You can change these settings anytime with /config
Viewing Configuration:
After setup, you can view or modify your configuration anytime:
# View all settings
You: /config list
Configuration:
VLLM_BASE_URL: http://localhost:8000/v1
VLLM_MODEL: stelterlab/Mistral-Small-24B-Instruct-2501-AWQ
MAX_INPUT_TOKENS: 12000
MAX_OUTPUT_TOKENS: 4000
TEMPERATURE: 0.1
# Update a setting
You: /config set VLLM_BASE_URL http://YOUR_SERVER:8000/v1
✓ Updated VLLM_BASE_URL in ~/.zorac/config.json
# See all available commands
You: /help
Alternative (Source Users):
If running from source, you can also create a .env file:
VLLM_BASE_URL=http://localhost:8000/v1
VLLM_MODEL=stelterlab/Mistral-Small-24B-Instruct-2501-AWQ
Documentation
User Guides
- Installation Guide - All installation methods (binary, source, development)
- Configuration Guide - Server setup, token limits, model parameters
- Usage Guide - Commands, session management, tips & tricks
Technical Documentation
- Development Guide - Contributing, testing, release process
- Server Setup - Complete vLLM server installation and optimization
- Claude.md - AI assistant development guide
- Changelog - Version history and release notes
- Contributing - Contribution guidelines
Supported Hardware
This works on consumer gaming GPUs:
| GPU | VRAM | Model Size | Performance |
|---|---|---|---|
| RTX 4090 | 24GB | Up to 24B (AWQ) | 60-65 tok/s ⭐ |
| RTX 3090 Ti | 24GB | Up to 24B (AWQ) | 55-60 tok/s |
| RTX 3090 | 24GB | Up to 24B (AWQ) | 55-60 tok/s |
| RTX 4080 | 16GB | Up to 14B (AWQ) | 45-50 tok/s |
| RTX 4070 Ti | 12GB | Up to 7B (AWQ) | 40-45 tok/s |
| RTX 3080 | 10GB | Up to 7B (AWQ) | 35-40 tok/s |
Recommended configuration. See SERVER_SETUP.md for optimization details.
Use Cases
- Local ChatGPT alternative - Private conversations, no data collection
- Coding assistant - Works with Continue.dev, Cline, and other AI coding tools
- Agentic workflows - LangChain/LangGraph running entirely local
- Content generation - Write, summarize, analyze - all offline
- AI experimentation - Test prompts and models without API costs
- Learning AI/ML - Understand LLM inference without cloud dependencies
Why Mistral-Small-24B-AWQ?
This application is optimized for Mistral-Small-24B-Instruct-2501-AWQ:
- Superior Intelligence - 24B parameters offers significantly better reasoning than 7B/8B models
- Consumer Hardware Ready - 4-bit AWQ quantization fits in 24GB VRAM
- High Performance - AWQ with Marlin kernel enables 60-65 tok/s on RTX 4090
You can use any vLLM-compatible model by changing the VLLM_MODEL setting.
FAQ
Can I run this without a GPU?
No, this requires an NVIDIA GPU with at least 10GB VRAM. CPU-only inference is too slow for interactive chat (would take minutes per response).
How does this compare to running Ollama?
Zorac uses vLLM for faster inference (60+ tok/s vs Ollama's 20-30 tok/s on the same hardware) and supports more advanced features like tool calling for agentic workflows. Ollama is easier to set up but slower for production use.
Do I need to be online?
Only for the initial model download (~14GB for Mistral-24B-AWQ). After that, everything runs completely offline on your local machine.
Is this legal? Can I use this commercially?
Yes! Mistral-Small is Apache 2.0 licensed, which allows free commercial use. vLLM is also open source (Apache 2.0). No restrictions.
What about AMD GPUs or Mac M-series chips?
This guide is specifically for NVIDIA GPUs using CUDA. For AMD GPUs, you'd need ROCm support (experimental). For Mac M-series, check out MLX or llama.cpp instead.
How much does it cost to run?
Electricity cost for an RTX 4090 running at ~300W is roughly $0.05-0.10 per hour (depending on your electricity rates). Far cheaper than API costs for heavy usage.
What other models can I run?
Any model with vLLM support: Llama, Qwen, Phi, DeepSeek, etc. Just change the VLLM_MODEL setting. Check vLLM's supported models.
Requirements
- For Binary Users: Nothing! Just download and run.
- For Source Users: Python 3.13+,
uvpackage manager - For Server: NVIDIA GPU with 10GB+ VRAM, vLLM inference server
License
MIT License - see LICENSE for details.
Contributing
Contributions are welcome! See CONTRIBUTING.md for guidelines.
Support
- Read the Documentation
- Report bugs via GitHub Issues
- Request features via GitHub Issues
- Check vLLM Documentation for server issues
Star this repo if you find it useful! ⭐
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zorac-1.2.0.tar.gz.
File metadata
- Download URL: zorac-1.2.0.tar.gz
- Upload date:
- Size: 950.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9cc33bcb644ada4fd1a4c799b5fd7062f66d47c0d9880f9aa3770db3b3cfbd5
|
|
| MD5 |
29742f7106f011ead95f811039482d87
|
|
| BLAKE2b-256 |
43ad2f3316edfb0ca63b9c3bacdaf201ab8adcea4d73847bb118366142b19829
|
Provenance
The following attestation bundles were made for zorac-1.2.0.tar.gz:
Publisher:
release.yml on chris-colinsky/Zorac
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zorac-1.2.0.tar.gz -
Subject digest:
a9cc33bcb644ada4fd1a4c799b5fd7062f66d47c0d9880f9aa3770db3b3cfbd5 - Sigstore transparency entry: 907828586
- Sigstore integration time:
-
Permalink:
chris-colinsky/Zorac@3b0ff21767b41cde5b8de177faa271d8dc9a771f -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/chris-colinsky
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@3b0ff21767b41cde5b8de177faa271d8dc9a771f -
Trigger Event:
push
-
Statement type:
File details
Details for the file zorac-1.2.0-py3-none-any.whl.
File metadata
- Download URL: zorac-1.2.0-py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
650971b31671383116a21dfc6c7e32ccc9b3ca4c72a0fa9cb0f5336b69c22fb9
|
|
| MD5 |
64856c7eb1eb4006d4699fefb7874379
|
|
| BLAKE2b-256 |
43d5e61e79adb80de811b29ad22b7732ea076e8363cf9309a209545da507605d
|
Provenance
The following attestation bundles were made for zorac-1.2.0-py3-none-any.whl:
Publisher:
release.yml on chris-colinsky/Zorac
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zorac-1.2.0-py3-none-any.whl -
Subject digest:
650971b31671383116a21dfc6c7e32ccc9b3ca4c72a0fa9cb0f5336b69c22fb9 - Sigstore transparency entry: 907828597
- Sigstore integration time:
-
Permalink:
chris-colinsky/Zorac@3b0ff21767b41cde5b8de177faa271d8dc9a771f -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/chris-colinsky
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@3b0ff21767b41cde5b8de177faa271d8dc9a771f -
Trigger Event:
push
-
Statement type: