Hybrid LLM runtime — minimal VRAM, always-on GPU prefill, optimised CPU inference

These details have not been verified by PyPI

Project links

Project description

Krasis

Rust + PyO3 MoE runtime for large mixture-of-experts LLMs. Runs 350B+ parameter models on commodity hardware with full GPU prefill and efficient CPU decode.

You can contact me here but please don't ask for help getting Krasis working. If a model doesn't work or a particular hardware config then you can try to narrow it down and then report an issue.

Krasis runs MoE LLMs fast on consumer level hardware

Krasis can run MoE language models that are much too large to fit in a consumer GPU (multi-hundred gigabyte modesl with 100 - 500+ billion parameters) on consumer or accessible server hardware you can actually buy without a second mortgage and your own personal power station.

Crucially, it runs these models at a speed that is usable.

Qwen3-Coder-Next / 1,060 tok/s prefill / 14.8 tok/s decode

For example, running Qwen3-Coder-Next (80B params, 148 GB BF16) on a single-socket EPYC 7742 with 1x RTX 2000 Ada 16 GB, Krasis achieves 1,060 tokens/sec prefill and 14.8 tokens/sec decode.

How LLMs work

LLM model operation consist of two key steps:

Prefill (handling potentially large amounts of input coming into the model)
Decode (handling the generation of text after processing the input data)

These are essentially the LLM reading (prefill) and writing (decode).

Prefill is best handled by the GPUs (large amounts of very parallel matrix multiplication, but on typical LLM runtimes its not possible to do more than offload a little of the large model onto the GPU.

The result is that you enter a simple chat prompt and it responds in a reasonable time, but if you hand it a file to read or try to work with it in an IDE, you wait minutes for it to even start generating text.

Krasis employs a different approach that utilises the GPU and system RAM more heavily which results in much faster prefill times. In practice this means the model will generate text at a similar speed (faster in some cases due to other optimisations) but you wait much less time for an answer, and the model can read files much more quickly.

Krasis tradeoffs

In order to achieve these speeds, Krasis has a few requirements.

Krasis uses more system RAM than other runtimes, you may need 2x the model weights worth of system ram (so to run a 100GB model you may need 200GB of system ram), but this is almost always far more achievable than the equivalent VRAM.
Krasis must be given the BF16 safetensors model* downloaded from (HuggingFace)[https://huggingface.co/]
Krasis can build everything it needs from this model or if you prefer you can give it a second GGUF model (in addition to the BF16 safetensors model) which takes advantage of more advanced quantisation (e.g. unsloth Q4_K models)
Krasis currently only works with NVidia GPUs
Krasis may take some time on the first run as it is doing a lot of pre-run work to optimise everything, major parts of this are cached for later runs though so they are generally much shorter startup times.
Krasis optimises models and caches them in .krasis, these can be large so you may need the original model x3 space or if you provide a GGUF in addition to the BF16 you may need 4x the space.

Supported Models

Model	Params	BF16 Size	Experts	Attention
Qwen3-Coder-Next	80B	148 GB	512 routed, top-10	Hybrid (36 linear + 12 GQA)
Qwen3-235B-A22B	235B	438 GB	128 routed, top-8	GQA
DeepSeek V2-Lite	16B	29 GB	64 + 2 shared, top-6	MLA
GLM-4.7	358B	667 GB	160 + 1 shared, top-8	GQA (partial RoPE, bias)

Benchmark: EPYC 7742 + 1x RTX 2000 Ada 16 GB

Hardware: AMD EPYC 7742 (64 cores, 4 NUMA nodes), DDR4-2666 8-channel, 1x NVIDIA RTX 2000 Ada 16 GB, PCIe 4.0 x8.

Config: BF16 attention, FP8 KV cache, INT8 shared/MLP/lm_head, LGS=2, 40 CPU threads, NUMA-aware thread pinning + interleaved allocation.

Benchmark uses 10K–50K token prompts (prefill) and 64-token generation runs (decode). Prefill speed is best of 20K/35K/50K. Decode is average of 3 runs with different prompts.

Model	Expert Quant	Prefill (tok/s)	TTFT @ 20K	Decode (tok/s)	ms/tok
Qwen3-Coder-Next	INT4 GPU + INT4 CPU	1,060	18.9s	14.84	67.6
Qwen3-Coder-Next	INT8 GPU + INT8 CPU	873	40.1s	12.41	80.6
DeepSeek V2-Lite	INT4 GPU + INT4 CPU	1,477	13.6s	20.18	49.7
DeepSeek V2-Lite	INT8 GPU + INT8 CPU	1,317	15.2s	17.84	56.2

INT4 experts give ~20% faster decode and ~20% faster prefill than INT8 due to halved memory bandwidth requirements. INT4 quantization quality is validated in the perplexity table below.

Perplexity (Quantization Quality)

Measured with INT4 GPU + INT4 CPU experts, BF16 attention, INT8 shared/MLP/lm_head, FP8 KV cache. Sliding window (2048 tokens, stride 1024), GPU Marlin prefill.

Model	Dataset	Tokens	PPL	BPC	Throughput
Qwen3-Coder-Next	WikiText-2	299K	10.64	3.41	121 tok/s
Qwen3-Coder-Next	C4 validation	500K	12.44	3.64	123 tok/s
DeepSeek V2-Lite	WikiText-2	307K	6.03	2.59	593 tok/s
DeepSeek V2-Lite	C4 validation	500K	9.22	3.20	573 tok/s

Quick Start

Install

# Update APT
sudo apt update   # Ubuntu/Debian

# Install pipx if you don't have it
sudo apt install pipx   # Ubuntu/Debian
# or: pip install --user pipx

# Install Krasis
pipx install krasis
pipx ensurepath        # adds ~/.local/bin to PATH (restart terminal or source ~/.bashrc)

# Run setup — installs CUDA toolkit, PyTorch, FlashInfer, ninja
# (will prompt for your password when installing system packages)
krasis-setup

Download a model

# Install huggingface-cli if you don't have it
pip install huggingface-hub

# Download a model into ~/.krasis/models/
huggingface-cli download Qwen/Qwen3-Coder-Next \
    --local-dir ~/.krasis/models/Qwen3-Coder-Next

Run

krasis

That's it. The launcher walks you through model selection and configuration. First run takes longer as Krasis builds optimised weight caches.

WSL (Windows Subsystem for Linux)

Krasis works on WSL2. By default WSL only uses 50% of your system RAM, which is usually not enough for large models. Create or edit C:\Users\<YourUsername>\.wslconfig:

[wsl2]
memory=120GB

Adjust the value to leave ~8 GB for Windows. Then restart WSL from PowerShell:

wsl --shutdown

Then follow the install steps above inside WSL.

Alternative: pip in a venv

python3 -m venv ~/.krasis-env && source ~/.krasis-env/bin/activate
pip install krasis
krasis-setup

Alternative: from source

git clone https://github.com/brontoguana/krasis.git
cd krasis
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
krasis-setup
./krasis

Usage

Interactive Launcher

krasis

The launcher walks you through a TUI with four screens:

Model selection — scans ~/.krasis/models/ for safetensors models, shows architecture, layer count, expert count, and estimated RAM
CPU expert source — build INT4 or INT8 from the native model, or select an existing GGUF file
GPU selection — multi-select your GPUs (Space to toggle, Enter to confirm)
Configuration editor — tune all quantization and runtime options with a live VRAM budget display showing per-GPU memory usage and estimated context length

All settings are saved to ~/.krasis/config and reloaded on subsequent launches.

On the final screen you can choose to launch immediately or run a benchmark first.

Non-Interactive Launch

# Use saved config from last TUI session
krasis --non-interactive

# Override specific settings
krasis --non-interactive --model-path /path/to/model --num-gpus 2 --benchmark

Benchmark Suite

Run all model × config combinations automatically from a single config file. Edit benchmarks/benchmark_suite.toml to define which models and hardware configurations to test:

[[config]]
num_gpus = 1
gpu_expert_bits = 4
cpu_expert_bits = 4

[[config]]
num_gpus = 2
gpu_expert_bits = 4
cpu_expert_bits = 4

[[model]]
name = "DeepSeek-V2-Lite"

[[model]]
name = "Qwen3-235B-A22B"
gguf_name = "Qwen3-235B-A22B-GGUF"   # searched in ~/.krasis/models/ subdirs

Model name is the directory name under ~/.krasis/models/. Use gguf_name to pair a native model with a GGUF for CPU experts (filename searched in models dir), or gguf_path for an absolute path. Config fields include num_gpus, gpu_expert_bits, cpu_expert_bits, attention_quant, kv_dtype, and more — see the config file comments for the full list.

Run the suite:

krasis --benchmark-suite                           # uses benchmarks/benchmark_suite.toml
krasis --benchmark-suite /path/to/custom.toml      # custom config

Each combination runs as an isolated subprocess. Per-combo logs are saved to benchmarks/suite_logs/ and a markdown summary table is generated at the end.

For launcher flags, per-component quantization options, and direct server usage, see ADVANCED.md.

Chat Client

krasis-chat                          # auto-discovers running servers
krasis-chat --port 8012              # connect to specific port
krasis-chat --url http://host:8012   # connect to remote server
krasis-chat --temperature 0.3        # override sampling temperature

The chat client auto-discovers running Krasis servers via ~/.krasis/servers/. Commands: /new (clear history), /system PROMPT (change system prompt), /exit.

API

The server exposes an OpenAI-compatible API at http://localhost:8012/v1/chat/completions with SSE streaming, compatible with Cursor, OpenCode, and any OpenAI SDK client.

Additional endpoints:

GET /health — server status
GET /v1/models — list loaded models
POST /v1/timing — toggle instrumentation at runtime

License

SSPL-1.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Feb 27, 2026

This version

0.1.45

Feb 25, 2026

0.1.44

Feb 23, 2026

0.1.43

Feb 23, 2026

0.1.42

Feb 23, 2026

0.1.41

Feb 23, 2026

0.1.40

Feb 23, 2026

0.1.39

Feb 23, 2026

0.1.35

Feb 22, 2026

0.1.32

Feb 22, 2026

0.1.31

Feb 22, 2026

0.1.30

Feb 22, 2026

0.1.29

Feb 22, 2026

0.1.28

Feb 22, 2026

0.1.27

Feb 22, 2026

0.1.26

Feb 22, 2026

0.1.25

Feb 22, 2026

0.1.24

Feb 22, 2026

0.1.23

Feb 22, 2026

0.1.22

Feb 22, 2026

0.1.21

Feb 22, 2026

0.1.20

Feb 22, 2026

0.1.19

Feb 22, 2026

0.1.18

Feb 22, 2026

0.1.17

Feb 22, 2026

0.1.16

Feb 22, 2026

0.1.15

Feb 22, 2026

0.1.14

Feb 22, 2026

0.1.13

Feb 22, 2026

0.1.12

Feb 22, 2026

0.1.11

Feb 22, 2026

0.1.10

Feb 22, 2026

0.1.9

Feb 22, 2026

0.1.8

Feb 22, 2026

0.1.7

Feb 16, 2026

0.1.6

Feb 15, 2026

0.1.5

Feb 15, 2026

0.1.4

Feb 15, 2026

0.1.3

Feb 15, 2026

0.1.2

Feb 15, 2026

0.1.1

Feb 15, 2026

0.1.0

Feb 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krasis-0.1.45.tar.gz (1.2 MB view details)

Uploaded Feb 25, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

krasis-0.1.45-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded Feb 25, 2026 CPython 3.13manylinux: glibc 2.17+ x86-64

krasis-0.1.45-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded Feb 25, 2026 CPython 3.12manylinux: glibc 2.17+ x86-64

krasis-0.1.45-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded Feb 25, 2026 CPython 3.11manylinux: glibc 2.17+ x86-64

krasis-0.1.45-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded Feb 25, 2026 CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file krasis-0.1.45.tar.gz.

File metadata

Download URL: krasis-0.1.45.tar.gz
Upload date: Feb 25, 2026
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for krasis-0.1.45.tar.gz
Algorithm	Hash digest
SHA256	`02766d55cb2bfadc2474159d36617386869e930cae745578da15378e9b250d5c`
MD5	`591d09350f3d0c29d6ff7c330ef1b9fd`
BLAKE2b-256	`b089960f022350b49456dc5b43d4a6e520e336ae4471db465ce19ed07750e679`

See more details on using hashes here.

File details

Details for the file krasis-0.1.45-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: krasis-0.1.45-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Feb 25, 2026
Size: 2.4 MB
Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for krasis-0.1.45-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`fbd3dda01e45dddaa36e17210391cf92bfb04a611ce3e875626da3e00dc7f6fd`
MD5	`71857d80a241e190e5320f84c7d0c075`
BLAKE2b-256	`cdc8679dec067a5d03b12900201f778d34567a5acb986bd5cfbbc0762b689f28`

See more details on using hashes here.

File details

Details for the file krasis-0.1.45-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: krasis-0.1.45-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Feb 25, 2026
Size: 2.4 MB
Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for krasis-0.1.45-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`566fb14a2dd93a74d7f3281013fbfe33cdd3f9480bcfb86eefb5f08cd64bd3c1`
MD5	`7de04e08e75b9eca5dfcafb215ec972d`
BLAKE2b-256	`f485834d2de3086591e1df5f6e5393f0d67bdc7bc542917a92fcda9185c8022a`

See more details on using hashes here.

File details

Details for the file krasis-0.1.45-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: krasis-0.1.45-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Feb 25, 2026
Size: 2.4 MB
Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for krasis-0.1.45-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`d50c6062a5024a8b32eeaba2220fba02efe83aeaa6aca269ce79cb1af3f01c8d`
MD5	`c83cf12d94ff3af2e2e931f69e8d740a`
BLAKE2b-256	`75bbb4cbc3c14209e3ef10d225de7f7d2e81c36de77c1504e622b243c0488d36`

See more details on using hashes here.

File details

Details for the file krasis-0.1.45-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: krasis-0.1.45-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Feb 25, 2026
Size: 2.4 MB
Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for krasis-0.1.45-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`454e32c37dabe123b6cc2bb36d599d9130f27d362e93c6a61b4f951b0fccc8ee`
MD5	`896e307c8dc694e39c474c93ef4e1398`
BLAKE2b-256	`8a51ebff0a823f8d1e7f70c1cd0797db9513f42b26c22e3786e1b725093bd74c`

See more details on using hashes here.

krasis 0.1.45

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Krasis

Krasis runs MoE LLMs fast on consumer level hardware

Qwen3-Coder-Next / 1,060 tok/s prefill / 14.8 tok/s decode

How LLMs work

Krasis tradeoffs

Supported Models

Benchmark: EPYC 7742 + 1x RTX 2000 Ada 16 GB

Perplexity (Quantization Quality)

Quick Start

Install

Download a model

Run

WSL (Windows Subsystem for Linux)

Alternative: pip in a venv

Alternative: from source

Usage

Interactive Launcher

Non-Interactive Launch

Benchmark Suite

Chat Client

API

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes