Skip to main content

Hybrid LLM runtime — minimal VRAM, always-on GPU prefill, optimised CPU inference

Project description

Krasis

Rust + PyO3 MoE runtime for large mixture-of-experts LLMs. Runs 350B+ parameter models on commodity hardware with full GPU prefill and efficient CPU decode.

You can contact me here but please don't ask for help getting Krasis working. If a model doesn't work or a particular hardware config then you can try to narrow it down and then report an issue.

Krasis runs MoE LLMs fast on consumer level hardware

Krasis can run MoE language models that are much too large to fit in a consumer GPU (multi-hundred gigabyte modesl with 100 - 500+ billion parameters) on consumer or accessible server hardware you can actually buy without a second mortgage and your own personal power station.

Crucially, it runs these models at a speed that is usable.

Qwen3-Coder-Next / 856 tok/s prefill / 10.5 tok/s decode##

For example, running Qwen3-Coder-Next (80B params, 146GB BF16) on a single-cpu Epyc server (7742) with 2x Ada 2000 16GB, Krasis achieves 856 tokens/sec prefill and 10.5 tokens/sec decode

How LLMs work

LLM model operation consist of two key steps:

  1. Prefill (handling potentially large amounts of input coming into the model)
  2. Decode (handling the generation of text after processing the input data)

These are essentially the LLM reading (prefill) and writing (decode).

Prefill is best handled by the GPUs (large amounts of very parallel matrix multiplication, but on typical LLM runtimes its not possible to do more than offload a little of the large model onto the GPU.

The result is that you enter a simple chat prompt and it responds in a reasonable time, but if you hand it a file to read or try to work with it in an IDE, you wait minutes for it to even start generating text.

Krasis employs a different approach that utilises the GPU and system RAM more heavily which results in much faster prefill times. In practice this means the model will generate text at a similar speed (faster in some cases due to other optimisations) but you wait much less time for an answer, and the model can read files much more quickly.

Krasis tradeoffs

In order to achieve these speeds, Krasis has a few requirements.

  • Krasis uses more system RAM than other runtimes, you may need 2x the model weights worth of system ram (so to run a 100GB model you may need 200GB of system ram), but this is almost always far more achievable than the equivalent VRAM.
  • Krasis must be given the BF16 safetensors model* downloaded from (HuggingFace)[https://huggingface.co/]
  • Krasis can build everything it needs from this model or if you prefer you can give it a second GGUF model (in addition to the BF16 safetensors model) which takes advantage of more advanced quantisation (e.g. unsloth Q4_K models)
  • Krasis currently only works with NVidia GPUs
  • Krasis may take some time on the first run as it is doing a lot of pre-run work to optimise everything, major parts of this are cached for later runs though so they are generally much shorter startup times.
  • Krasis optimises models and caches them in .krasis, these can be large so you may need the original model x3 space or if you provide a GGUF in addition to the BF16 you may need 4x the space.

Known Supported Models and Benchmark Speeds

Speeds reported in the following models are benchmarked on the following hardware:

  • Epyc 7742
  • DDR4 2666 RAM (8x channels)
  • 2x RTX Ada 2000
Model Params BF16 Size Experts Attention Prefill Decode
Qwen3-Coder-Next 80B 148 GB 512 routed, top-10 Hybrid (36 linear + 12 GQA) 812 tok/s 10.5 tok/s
Qwen3-235B-A22B 235B 438 GB 128 routed, top-8 GQA 198 tok/s 1.65 tok/s
DeepSeek V2-Lite 16B 29 GB 64 + 2 shared, top-6 MLA 2,400 tok/s 5.8 tok/s
GLM-4.7 358B 667 GB 160 + 1 shared, top-8 GQA (partial RoPE, bias) untested untested

Quick Start

Option A: pipx install (recommended)

# Install pipx if you don't have it
sudo apt install pipx   # Ubuntu/Debian
# or: pip install --user pipx

# Install Krasis (isolated environment, no conflicts)
pipx install krasis
pipx ensurepath        # adds ~/.local/bin to PATH (restart terminal or source ~/.bashrc)

# PyTorch with CUDA is required — inject into the pipx environment
pipx inject krasis torch --index-url https://download.pytorch.org/whl/cu126

# Download a model into ~/.krasis/models/
huggingface-cli download Qwen/Qwen3-Coder-Next \
    --local-dir ~/.krasis/models/Qwen3-Coder-Next

# Launch
krasis

Alternative: If you prefer pip, create a venv first: python3 -m venv ~/.krasis-env && source ~/.krasis-env/bin/activate && pip install krasis torch --index-url https://download.pytorch.org/whl/cu126

Option B: from source

# Prerequisites (Ubuntu/Debian)
sudo apt update && sudo apt install python3.12-venv

# Clone and run — everything else is automatic
git clone https://github.com/brontoguana/krasis.git
cd krasis
./krasis

Building from Source

The ./krasis launcher handles building automatically on first run. For manual/development setup:

git clone https://github.com/brontoguana/krasis.git
cd krasis
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

# PyTorch must be installed separately
pip install torch --index-url https://download.pytorch.org/whl/cu126

Usage

Interactive Launcher (recommended)

krasis        # pip install
./krasis      # from source

The launcher walks you through a TUI with four screens:

  1. Model selection — scans ~/.krasis/models/ for safetensors models, shows architecture, layer count, expert count, and estimated RAM
  2. CPU expert source — build INT4 or INT8 from the native model, or select an existing GGUF file
  3. GPU selection — multi-select your GPUs (Space to toggle, Enter to confirm)
  4. Configuration editor — tune all quantization and runtime options with a live VRAM budget display showing per-GPU memory usage and estimated context length

All settings are saved to ~/.krasis/config and reloaded on subsequent launches.

On the final screen you can choose to launch immediately or run a benchmark first.

Non-Interactive Launch

# Use saved config from last TUI session
krasis --non-interactive

# Override specific settings
krasis --non-interactive --model-path /path/to/model --num-gpus 2 --benchmark

Benchmark Suite

Run all model × config combinations automatically from a single config file. Edit benchmarks/benchmark_suite.toml to define which models and hardware configurations to test:

[[config]]
num_gpus = 1
gpu_expert_bits = 4
cpu_expert_bits = 4

[[config]]
num_gpus = 2
gpu_expert_bits = 4
cpu_expert_bits = 4

[[model]]
name = "DeepSeek-V2-Lite"

[[model]]
name = "Qwen3-235B-A22B"
gguf_name = "Qwen3-235B-A22B-GGUF"   # searched in ~/.krasis/models/ subdirs

Model name is the directory name under ~/.krasis/models/. Use gguf_name to pair a native model with a GGUF for CPU experts (filename searched in models dir), or gguf_path for an absolute path. Config fields include num_gpus, gpu_expert_bits, cpu_expert_bits, attention_quant, kv_dtype, and more — see the config file comments for the full list.

Run the suite:

krasis --benchmark-suite                           # uses benchmarks/benchmark_suite.toml
krasis --benchmark-suite /path/to/custom.toml      # custom config

Each combination runs as an isolated subprocess. Per-combo logs are saved to benchmarks/suite_logs/ and a markdown summary table is generated at the end.

For launcher flags, per-component quantization options, and direct server usage, see ADVANCED.md.

Chat Client

krasis-chat                          # auto-discovers running servers
krasis-chat --port 8012              # connect to specific port
krasis-chat --url http://host:8012   # connect to remote server
krasis-chat --temperature 0.3        # override sampling temperature

The chat client auto-discovers running Krasis servers via ~/.krasis/servers/. Commands: /new (clear history), /system PROMPT (change system prompt), /exit.

API

The server exposes an OpenAI-compatible API at http://localhost:8012/v1/chat/completions with SSE streaming, compatible with Cursor, OpenCode, and any OpenAI SDK client.

Additional endpoints:

  • GET /health — server status
  • GET /v1/models — list loaded models
  • POST /v1/timing — toggle instrumentation at runtime

License

AGPL-3.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krasis-0.1.15.tar.gz (565.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

krasis-0.1.15-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

krasis-0.1.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

krasis-0.1.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

krasis-0.1.15-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file krasis-0.1.15.tar.gz.

File metadata

  • Download URL: krasis-0.1.15.tar.gz
  • Upload date:
  • Size: 565.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for krasis-0.1.15.tar.gz
Algorithm Hash digest
SHA256 60807bb3f3312db1559a7b149e498cefd908335f6d5a634c72f8697f6f15a62d
MD5 3b7211e4ea27bb3e45bc11b8fe95a945
BLAKE2b-256 eb6e91ce95b103447dd97efab4ec73af6543b069cf8c3622a2b58c6727d3d01a

See more details on using hashes here.

File details

Details for the file krasis-0.1.15-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for krasis-0.1.15-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 79f25deed4378238b740953acfc7c0b6c8a0c2c9ecf5c23ad9d0dbf2b4df8c72
MD5 1e26f1a21db87c30ce21aeac53ac4bbb
BLAKE2b-256 aba9c7d7fe753f38a99ab2b211f9193ef90058ae1a10a06cc7f482bbdab68414

See more details on using hashes here.

File details

Details for the file krasis-0.1.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for krasis-0.1.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b3e58f594646fbf4614bc26bda760e502041aa48a68276e237fd1560a137ab2f
MD5 f4f83dc7c0af5b4a37a3957af94ed633
BLAKE2b-256 2cffe7c69bf33f0de43a1fea1ae8fd25ddb9fd28a51c8b9ac16e67c66a5c5cca

See more details on using hashes here.

File details

Details for the file krasis-0.1.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for krasis-0.1.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 31853066eb01247020f2188a1a715b20af90263a2fdb537d5f5e370cdd4a6ce9
MD5 a4721dcc5a91b774fa4000ce55f32e69
BLAKE2b-256 e0ed23195ad71fa23164ba2af261c40f3e3e5576b96395b157fbf73330682936

See more details on using hashes here.

File details

Details for the file krasis-0.1.15-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for krasis-0.1.15-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5266c3e3ffb8067d644c6f0880608ab712452ddeb2c9ce4b4e51ac4bcde70208
MD5 fcea8bf9e0a6018100cf488d345c5944
BLAKE2b-256 ba0191cb552c1b82dbbfb0ce7093f75a76adcfc8ec4099498c5520f883c959d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page