Ollama-style GGUF runner with SSD-backed llama.cpp KV cache
Project description
DiskLLM
Run a 7B LLM on 258MB RAM. Stock llama.cpp needs 6000MB. DiskLLM uses your SSD instead.
DiskLLM is an Ollama-style Python CLI and patched llama.cpp backend that keeps the KV cache on SSD through mmap instead of allocating the full cache in private RAM. It launches an OpenAI-compatible llama-server, downloads GGUF models from HuggingFace, and persists context sessions across restarts.
Results
| Model | Context | Stock RAM | DiskLLM RAM | Reduction | tok/s |
|---|---|---|---|---|---|
| Qwen2.5 3B | 65K | ~2,000 MB | ~200 MB | 10x | 8.9 |
| Qwen2.5 7B | 65K | ~6,000 MB | 258 MB | 23x | 4.5 |
| Qwen2.5 7B | 256K | 14,900 MB | 258 MB | 57x | 2.5 |
| LFM2.5 8B | 128K | 1,845 MB | 172 MB | 10.7x | 10.0 |
| Qwen2.5 14B | 65K | ~10,000 MB | 289 MB | 34x | 1.7 |
| Qwen2.5 32B | 65K | ~20,000 MB | 424 MB | 47x | 0.1 |
| Metric | Value |
|---|---|
| Max RAM reduction | 57x (256K context) |
| Min private RAM | 172 MB (LFM2.5 8B) |
| Best tok/s | 10.0 (LFM2.5 8B) |
| Largest model run | 32B on 2.5GB free RAM |
| KV cache on SSD | 112 MB (vs GBs in RAM) |
| Session file size | 7.33 MB (persists across restarts) |
Quick Start
pip install -e .
diskllm pull qwen2.5:7b
diskllm run qwen2.5:7b --session demo --headless
diskllm chat qwen2.5:7b --session demo --no-start
The server exposes an OpenAI-compatible API at http://127.0.0.1:8080/v1.
How It Works
Stock llama.cpp allocates KV tensors in RAM. At long context, that KV allocation can dominate system memory even when model weights are memory-mapped.
DiskLLM adds a patched llama.cpp --kv-backend ssd mode. The backend creates a CPU-addressable mmap file under ~/.diskllm/kv_cache, keeps only the active attention window hot, and lets the OS page the rest through NVMe storage. The patch is wired through dense KV, SWA, hybrid, and iSWA cache paths.
diskllm run always starts the patched server with:
-ngl 0
--kv-backend ssd
--kv-path ~/.diskllm/kv_cache
--kv-window 2048
--host 0.0.0.0
--port 8080
--no-repack
When --session NAME is used, DiskLLM also passes --slot-save-path ~/.diskllm/sessions and uses llama-server slot save/restore APIs to persist context as ~/.diskllm/sessions/<name>.mmap.
Read more in docs/how-it-works.md.
Benchmarks
Benchmarks were run CPU-only on Windows 11 with an Intel i5-12450H, 16GB RAM, and NVMe SSD.
| Test | Result |
|---|---|
| Qwen2.5 7B, 65K context, headless startup | 258 MB private RAM |
| Qwen2.5 7B, restored session startup | 3.46s to healthy plus restore |
| Qwen2.5 7B, restored prompt reuse | 34 cached prompt tokens |
| Qwen2.5 7B, restored generation | 7.13 tok/s |
See docs/benchmarks.md for the full tables and hardware notes.
Model Registry
qwen2.5:3b bartowski/Qwen2.5-3B-Instruct-GGUF Q4_K_M
qwen2.5:7b bartowski/Qwen2.5-7B-Instruct-GGUF Q4_K_M
qwen2.5:14b bartowski/Qwen2.5-14B-Instruct-GGUF Q4_K_M
lfm2.5:8b LiquidAI/LFM2.5-8B-A1B-GGUF Q4_K_M
DiskLLM can also pull any HuggingFace GGUF repo and select a quant automatically:
diskllm pull bartowski/Qwen2.5-7B-Instruct-GGUF
diskllm pull bartowski/Qwen2.5-7B-Instruct-GGUF:Q5_K_M
Storage
~/.diskllm/models/ downloaded GGUF files
~/.diskllm/kv_cache/ SSD KV mmap files
~/.diskllm/sessions/ persistent slot KV sessions
~/.diskllm/config.json settings and installed model index
~/.diskllm/logs/ background server logs
~/.diskllm/bin/ patched llama-server binary
Override the home directory with DISKLLM_HOME.
Patched llama.cpp
The llama.cpp/ directory is tracked as a git submodule containing the patched fork. DiskLLM finds the patched server binary in this order:
~/.diskllm/bin/llama-server.exeDISKLLM_LLAMA_SERVERllama_serverin~/.diskllm/config.json- bundled package path
diskllm/bin/windows/llama-server.exe - development checkout path
llama.cpp/build/bin/llama-server.exe
If a binary is found outside ~/.diskllm, DiskLLM copies it into ~/.diskllm/bin and runs that copy.
OpenAI Client
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="diskllm")
reply = client.chat.completions.create(
model="diskllm",
messages=[{"role": "user", "content": "What is DiskLLM?"}],
)
print(reply.choices[0].message.content)
More examples are in examples/.
Contributing
Contributions are welcome. Start with CONTRIBUTING.md, keep changes small, and include benchmark or test evidence for runtime changes.
License
DiskLLM is released under the MIT License. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diskllm-0.1.0.tar.gz.
File metadata
- Download URL: diskllm-0.1.0.tar.gz
- Upload date:
- Size: 28.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6fdbb3763bb39918ad49ac4534fbba68734e452820a146038331dc8b701d5c1
|
|
| MD5 |
eaec0bfabca7a387cc78adf024767f50
|
|
| BLAKE2b-256 |
3d809fe5da4018e17aab6015306c36925de89441a4eb4ac1acbe8771d0dfa885
|
File details
Details for the file diskllm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: diskllm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d3b8b4a905ab87c9bd6087cf29dd0d8148c1c5c2d2b20133ab233b730f1a475
|
|
| MD5 |
ed291aa0614924ac935c94e13d4ff367
|
|
| BLAKE2b-256 |
d566821a9b6ffdd28d585fbb632becf0197bdd1aa7ba7291445a4141f78290ce
|