Skip to main content

Ollama-style GGUF runner with SSD-backed llama.cpp KV cache

Project description

DiskLLM

Run a 7B LLM on 258MB RAM. Stock llama.cpp needs 6000MB. DiskLLM uses your SSD instead.

DiskLLM architecture

DiskLLM is an Ollama-style Python CLI and patched llama.cpp backend that keeps the KV cache on SSD through mmap instead of allocating the full cache in private RAM. It launches an OpenAI-compatible llama-server, downloads GGUF models from HuggingFace, and persists context sessions across restarts.

Results

Model Context Stock RAM DiskLLM RAM Reduction tok/s
Qwen2.5 3B 65K ~2,000 MB ~200 MB 10x 8.9
Qwen2.5 7B 65K ~6,000 MB 258 MB 23x 4.5
Qwen2.5 7B 256K 14,900 MB 258 MB 57x 2.5
LFM2.5 8B 128K 1,845 MB 172 MB 10.7x 10.0
Qwen2.5 14B 65K ~10,000 MB 289 MB 34x 1.7
Qwen2.5 32B 65K ~20,000 MB 424 MB 47x 0.1
Metric Value
Max RAM reduction 57x (256K context)
Min private RAM 172 MB (LFM2.5 8B)
Best tok/s 10.0 (LFM2.5 8B)
Largest model run 32B on 2.5GB free RAM
KV cache on SSD 112 MB (vs GBs in RAM)
Session file size 7.33 MB (persists across restarts)

Quick Start

pip install -e .
diskllm pull qwen2.5:7b
diskllm run qwen2.5:7b --session demo --headless
diskllm chat qwen2.5:7b --session demo --no-start

The server exposes an OpenAI-compatible API at http://127.0.0.1:8080/v1.

How It Works

Stock llama.cpp allocates KV tensors in RAM. At long context, that KV allocation can dominate system memory even when model weights are memory-mapped.

DiskLLM adds a patched llama.cpp --kv-backend ssd mode. The backend creates a CPU-addressable mmap file under ~/.diskllm/kv_cache, keeps only the active attention window hot, and lets the OS page the rest through NVMe storage. The patch is wired through dense KV, SWA, hybrid, and iSWA cache paths.

diskllm run always starts the patched server with:

-ngl 0
--kv-backend ssd
--kv-path ~/.diskllm/kv_cache
--kv-window 2048
--host 0.0.0.0
--port 8080
--no-repack

When --session NAME is used, DiskLLM also passes --slot-save-path ~/.diskllm/sessions and uses llama-server slot save/restore APIs to persist context as ~/.diskllm/sessions/<name>.mmap.

Read more in docs/how-it-works.md.

Benchmarks

Benchmarks were run CPU-only on Windows 11 with an Intel i5-12450H, 16GB RAM, and NVMe SSD.

Test Result
Qwen2.5 7B, 65K context, headless startup 258 MB private RAM
Qwen2.5 7B, restored session startup 3.46s to healthy plus restore
Qwen2.5 7B, restored prompt reuse 34 cached prompt tokens
Qwen2.5 7B, restored generation 7.13 tok/s

See docs/benchmarks.md for the full tables and hardware notes.

Model Registry

qwen2.5:3b   bartowski/Qwen2.5-3B-Instruct-GGUF Q4_K_M
qwen2.5:7b   bartowski/Qwen2.5-7B-Instruct-GGUF Q4_K_M
qwen2.5:14b  bartowski/Qwen2.5-14B-Instruct-GGUF Q4_K_M
lfm2.5:8b    LiquidAI/LFM2.5-8B-A1B-GGUF Q4_K_M

DiskLLM can also pull any HuggingFace GGUF repo and select a quant automatically:

diskllm pull bartowski/Qwen2.5-7B-Instruct-GGUF
diskllm pull bartowski/Qwen2.5-7B-Instruct-GGUF:Q5_K_M

Storage

~/.diskllm/models/      downloaded GGUF files
~/.diskllm/kv_cache/    SSD KV mmap files
~/.diskllm/sessions/    persistent slot KV sessions
~/.diskllm/config.json  settings and installed model index
~/.diskllm/logs/        background server logs
~/.diskllm/bin/         patched llama-server binary

Override the home directory with DISKLLM_HOME.

Patched llama.cpp

The llama.cpp/ directory is tracked as a git submodule containing the patched fork. DiskLLM finds the patched server binary in this order:

  1. ~/.diskllm/bin/llama-server.exe
  2. DISKLLM_LLAMA_SERVER
  3. llama_server in ~/.diskllm/config.json
  4. bundled package path diskllm/bin/windows/llama-server.exe
  5. development checkout path llama.cpp/build/bin/llama-server.exe

If a binary is found outside ~/.diskllm, DiskLLM copies it into ~/.diskllm/bin and runs that copy.

OpenAI Client

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="diskllm")
reply = client.chat.completions.create(
    model="diskllm",
    messages=[{"role": "user", "content": "What is DiskLLM?"}],
)
print(reply.choices[0].message.content)

More examples are in examples/.

Contributing

Contributions are welcome. Start with CONTRIBUTING.md, keep changes small, and include benchmark or test evidence for runtime changes.

License

DiskLLM is released under the MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diskllm-0.1.0.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diskllm-0.1.0-py3-none-any.whl (29.9 kB view details)

Uploaded Python 3

File details

Details for the file diskllm-0.1.0.tar.gz.

File metadata

  • Download URL: diskllm-0.1.0.tar.gz
  • Upload date:
  • Size: 28.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for diskllm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d6fdbb3763bb39918ad49ac4534fbba68734e452820a146038331dc8b701d5c1
MD5 eaec0bfabca7a387cc78adf024767f50
BLAKE2b-256 3d809fe5da4018e17aab6015306c36925de89441a4eb4ac1acbe8771d0dfa885

See more details on using hashes here.

File details

Details for the file diskllm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: diskllm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for diskllm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7d3b8b4a905ab87c9bd6087cf29dd0d8148c1c5c2d2b20133ab233b730f1a475
MD5 ed291aa0614924ac935c94e13d4ff367
BLAKE2b-256 d566821a9b6ffdd28d585fbb632becf0197bdd1aa7ba7291445a4141f78290ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page