Skip to main content

Interactive CLI chat client for vLLM inference servers with persistent sessions and automatic context management

Project description

Zorac - Self-Hosted Local LLM Chat Client

PyPI Python License vLLM Platform GPU

A fun terminal chat client for running local LLMs on consumer hardware. Chat with powerful AI models like Mistral-24B privately on your own RTX 4090/3090 - no cloud, no costs, complete privacy.

Perfect for developers who want a self-hosted ChatGPT alternative running on their gaming PC or homelab server. Also good for local AI coding assistants, agentic workflows and agent development.

Named after ZORAC, the intelligent Ganymean computer from James P. Hogan's The Gentle Giants of Ganymede.

Why Self-Host Your LLM?

  • Zero ongoing costs - No API fees, run unlimited queries
  • Complete privacy - Your data never leaves your machine
  • Low latency - Sub-second responses on local hardware
  • Use existing hardware - Your gaming GPU works great for AI
  • Full control - Customize models, parameters, and behavior
  • Work offline - No internet required after initial setup

Features

  • Interactive CLI - Natural conversation flow with continuous input prompts
  • Rich Terminal UI - Beautiful formatted output with optimized readability
    • Left-aligned content with 60% width constraint for comfortable reading
    • Syntax-highlighted code blocks and formatted markdown
    • Clean, modern design without unnecessary clutter
  • Streaming Responses - Real-time token streaming with live markdown display
  • Persistent Sessions - Automatically saves and restores conversation history
  • Smart Context Management - Automatically summarizes old messages when approaching token limits
  • Token Tracking - Real-time monitoring of token usage with tiktoken
  • Performance Metrics - Displays tokens/second, response time, and resource usage
  • Configurable - Adjust all parameters via .env, config file, or runtime commands

Demo

Rich Terminal UI with Live Streaming

Interactive chat with real-time streaming responses, markdown rendering, and performance metrics

Zorac Chat Interface

Token Management & Commands

Built-in commands for session management and token tracking

Token Usage and Commands

Quick Start

1. Install Zorac

Homebrew (macOS/Linux):

brew tap chris-colinsky/zorac
brew install zorac

pip/pipx (All Platforms):

# Using pipx (recommended - isolated environment)
pipx install zorac

# Using pip
pip install zorac

# Using uv
uv tool install zorac

Windows Users: Use WSL (Windows Subsystem for Linux) and follow the Linux/pip instructions.

2. Set Up vLLM Server

You need a vLLM inference server running. See SERVER_SETUP.md for complete setup instructions.

3. Configure & Run

First Run:

When you start Zorac for the first time, you'll be greeted with a setup wizard:

$ zorac

     ███████╗ ██████╗ ██████╗  █████╗  ██████╗
     ╚══███╔╝██╔═══██╗██╔══██╗██╔══██╗██╔════╝
       ███╔╝ ██║   ██║██████╔╝███████║██║
      ███╔╝  ██║   ██║██╔══██╗██╔══██║██║
     ███████╗╚██████╔╝██║  ██║██║  ██║╚██████╗
     ╚══════╝ ╚═════╝ ╚═╝  ╚═╝╚═╝  ╚═╝ ╚═════╝
        intelligence running on localhost

────────────────────── Welcome to Zorac! ──────────────────────

This appears to be your first time running Zorac.
Let's configure your vLLM server connection.

Server Configuration:
  Default: http://localhost:8000/v1
  vLLM Server URL (or press Enter for default):

  Default: stelterlab/Mistral-Small-24B-Instruct-2501-AWQ
  Model name (or press Enter for default):

✓ Configuration saved to ~/.zorac/config.json
You can change these settings anytime with /config

Viewing Configuration:

After setup, you can view or modify your configuration anytime:

# View all settings
You: /config list

Configuration:
  VLLM_BASE_URL:      http://localhost:8000/v1
  VLLM_MODEL:         stelterlab/Mistral-Small-24B-Instruct-2501-AWQ
  MAX_INPUT_TOKENS:   12000
  MAX_OUTPUT_TOKENS:  4000
  TEMPERATURE:        0.1

# Update a setting
You: /config set VLLM_BASE_URL http://YOUR_SERVER:8000/v1
✓ Updated VLLM_BASE_URL in ~/.zorac/config.json

# See all available commands
You: /help

Alternative (Source Users):

If running from source, you can also create a .env file:

VLLM_BASE_URL=http://localhost:8000/v1
VLLM_MODEL=stelterlab/Mistral-Small-24B-Instruct-2501-AWQ

Documentation

User Guides

Technical Documentation

Supported Hardware

This works on consumer gaming GPUs:

GPU VRAM Model Size Performance
RTX 4090 24GB Up to 24B (AWQ) 60-65 tok/s ⭐
RTX 3090 Ti 24GB Up to 24B (AWQ) 55-60 tok/s
RTX 3090 24GB Up to 24B (AWQ) 55-60 tok/s
RTX 4080 16GB Up to 14B (AWQ) 45-50 tok/s
RTX 4070 Ti 12GB Up to 7B (AWQ) 40-45 tok/s
RTX 3080 10GB Up to 7B (AWQ) 35-40 tok/s

Recommended configuration. See SERVER_SETUP.md for optimization details.

Use Cases

  • Local ChatGPT alternative - Private conversations, no data collection
  • Coding assistant - Works with Continue.dev, Cline, and other AI coding tools
  • Agentic workflows - LangChain/LangGraph running entirely local
  • Content generation - Write, summarize, analyze - all offline
  • AI experimentation - Test prompts and models without API costs
  • Learning AI/ML - Understand LLM inference without cloud dependencies

Why Mistral-Small-24B-AWQ?

This application is optimized for Mistral-Small-24B-Instruct-2501-AWQ:

  • Superior Intelligence - 24B parameters offers significantly better reasoning than 7B/8B models
  • Consumer Hardware Ready - 4-bit AWQ quantization fits in 24GB VRAM
  • High Performance - AWQ with Marlin kernel enables 60-65 tok/s on RTX 4090

You can use any vLLM-compatible model by changing the VLLM_MODEL setting.

FAQ

Can I run this without a GPU?

No, this requires an NVIDIA GPU with at least 10GB VRAM. CPU-only inference is too slow for interactive chat (would take minutes per response).

How does this compare to running Ollama?

Zorac uses vLLM for faster inference (60+ tok/s vs Ollama's 20-30 tok/s on the same hardware) and supports more advanced features like tool calling for agentic workflows. Ollama is easier to set up but slower for production use.

Do I need to be online?

Only for the initial model download (~14GB for Mistral-24B-AWQ). After that, everything runs completely offline on your local machine.

Is this legal? Can I use this commercially?

Yes! Mistral-Small is Apache 2.0 licensed, which allows free commercial use. vLLM is also open source (Apache 2.0). No restrictions.

What about AMD GPUs or Mac M-series chips?

This guide is specifically for NVIDIA GPUs using CUDA. For AMD GPUs, you'd need ROCm support (experimental). For Mac M-series, check out MLX or llama.cpp instead.

How much does it cost to run?

Electricity cost for an RTX 4090 running at ~300W is roughly $0.05-0.10 per hour (depending on your electricity rates). Far cheaper than API costs for heavy usage.

What other models can I run?

Any model with vLLM support: Llama, Qwen, Phi, DeepSeek, etc. Just change the VLLM_MODEL setting. Check vLLM's supported models.

Requirements

  • For Binary Users: Nothing! Just download and run.
  • For Source Users: Python 3.13+, uv package manager
  • For Server: NVIDIA GPU with 10GB+ VRAM, vLLM inference server

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

Support


Star this repo if you find it useful! ⭐

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zorac-1.2.0.tar.gz (950.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zorac-1.2.0-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file zorac-1.2.0.tar.gz.

File metadata

  • Download URL: zorac-1.2.0.tar.gz
  • Upload date:
  • Size: 950.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for zorac-1.2.0.tar.gz
Algorithm Hash digest
SHA256 a9cc33bcb644ada4fd1a4c799b5fd7062f66d47c0d9880f9aa3770db3b3cfbd5
MD5 29742f7106f011ead95f811039482d87
BLAKE2b-256 43ad2f3316edfb0ca63b9c3bacdaf201ab8adcea4d73847bb118366142b19829

See more details on using hashes here.

Provenance

The following attestation bundles were made for zorac-1.2.0.tar.gz:

Publisher: release.yml on chris-colinsky/Zorac

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zorac-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: zorac-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for zorac-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 650971b31671383116a21dfc6c7e32ccc9b3ca4c72a0fa9cb0f5336b69c22fb9
MD5 64856c7eb1eb4006d4699fefb7874379
BLAKE2b-256 43d5e61e79adb80de811b29ad22b7732ea076e8363cf9309a209545da507605d

See more details on using hashes here.

Provenance

The following attestation bundles were made for zorac-1.2.0-py3-none-any.whl:

Publisher: release.yml on chris-colinsky/Zorac

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page