Skip to main content

Ollama-style GGUF runner with SSD-backed llama.cpp KV cache

Project description

DiskLLM

Run a 7B LLM on 258MB RAM. Stock llama.cpp needs 6000MB. DiskLLM uses your SSD instead.

DiskLLM architecture

DiskLLM is an Ollama-style Python CLI and patched llama.cpp backend that keeps the KV cache on SSD through mmap instead of allocating the full cache in private RAM. It launches an OpenAI-compatible llama-server, downloads GGUF models from Ollama or HuggingFace, and persists context sessions across restarts.

Results

Model Context Stock RAM DiskLLM RAM Reduction tok/s
Qwen2.5 3B 65K ~2,000 MB ~200 MB 10x 8.9
Qwen2.5 7B 65K ~6,000 MB 258 MB 23x 4.5
Qwen2.5 7B 256K 14,900 MB 258 MB 57x 2.5
LFM2.5 8B 128K 1,845 MB 172 MB 10.7x 10.0
Qwen2.5 14B 65K ~10,000 MB 289 MB 34x 1.7
Qwen2.5 32B 65K ~20,000 MB 424 MB 47x 0.1
Metric Value
Max RAM reduction 57x (256K context)
Min private RAM 172 MB (LFM2.5 8B)
Best tok/s 10.0 (LFM2.5 8B)
Largest model run 32B on 2.5GB free RAM
KV cache on SSD 112 MB (vs GBs in RAM)
Session file size 7.33 MB (persists across restarts)

Quick Start

pip install diskllm
diskllm pull qwen2.5:7b
diskllm run qwen2.5:7b --session demo --headless
diskllm chat qwen2.5:7b --session demo --no-start

The server exposes an OpenAI-compatible API at http://127.0.0.1:8080/v1.

How It Works

Stock llama.cpp allocates KV tensors in RAM. At long context, that KV allocation can dominate system memory even when model weights are memory-mapped.

DiskLLM adds a patched llama.cpp --kv-backend ssd mode. The backend creates a CPU-addressable mmap file under ~/.diskllm/kv_cache, keeps only the active attention window hot, and lets the OS page the rest through NVMe storage. The patch is wired through dense KV, SWA, hybrid, and iSWA cache paths.

diskllm run always starts the patched server with:

-ngl 0
--kv-backend ssd
--kv-path ~/.diskllm/kv_cache
--kv-window 2048
--host 0.0.0.0
--port 8080
--no-repack

When --session NAME is used, DiskLLM also passes --slot-save-path ~/.diskllm/sessions and uses llama-server slot save/restore APIs to persist context as ~/.diskllm/sessions/<name>.mmap.

Read more in docs/how-it-works.md.

Benchmarks

Benchmarks were run CPU-only on Windows 11 with an Intel i5-12450H, 16GB RAM, and NVMe SSD.

Test Result
Qwen2.5 7B, 65K context, headless startup 258 MB private RAM
Qwen2.5 7B, restored session startup 3.46s to healthy plus restore
Qwen2.5 7B, restored prompt reuse 34 cached prompt tokens
Qwen2.5 7B, restored generation 7.13 tok/s

See docs/benchmarks.md for the full tables and hardware notes.

Model Registry

qwen2.5:3b   bartowski/Qwen2.5-3B-Instruct-GGUF Q4_K_M
qwen2.5:7b   bartowski/Qwen2.5-7B-Instruct-GGUF Q4_K_M
qwen2.5:14b  bartowski/Qwen2.5-14B-Instruct-GGUF Q4_K_M
lfm2.5:8b    LiquidAI/LFM2.5-8B-A1B-GGUF Q4_K_M

DiskLLM can pull models from Ollama or HuggingFace. By default, diskllm pull tries Ollama first when the ollama CLI is installed, then falls back to HuggingFace. Force a source with:

diskllm pull qwen2.5:7b --source ollama
diskllm pull qwen2.5:7b --source huggingface

For HuggingFace GGUF repos, pass the repo name directly. DiskLLM lists repo files and picks Q4_K_M by default:

diskllm pull bartowski/Qwen2.5-7B-Instruct-GGUF
diskllm pull bartowski/Qwen2.5-7B-Instruct-GGUF:Q5_K_M

Storage

~/.diskllm/models/      downloaded GGUF files
~/.diskllm/kv_cache/    SSD KV mmap files
~/.diskllm/sessions/    persistent slot KV sessions
~/.diskllm/config.json  settings and installed model index
~/.diskllm/logs/        background server logs
~/.diskllm/bin/         patched llama-server binary

Override the home directory with DISKLLM_HOME.

Patched llama.cpp

The llama.cpp/ directory is tracked as a git submodule containing the patched fork. DiskLLM finds the patched server binary in this order:

  1. ~/.diskllm/bin/llama-server.exe
  2. DISKLLM_LLAMA_SERVER
  3. llama_server in ~/.diskllm/config.json
  4. bundled package path diskllm/bin/windows/llama-server.exe
  5. development checkout path llama.cpp/build/bin/llama-server.exe

If a binary is found outside ~/.diskllm, DiskLLM copies it into ~/.diskllm/bin and runs that copy.

OpenAI Client

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="diskllm")
reply = client.chat.completions.create(
    model="diskllm",
    messages=[{"role": "user", "content": "What is DiskLLM?"}],
)
print(reply.choices[0].message.content)

More examples are in examples/.

Contributing

Contributions are welcome. Start with CONTRIBUTING.md, keep changes small, and include benchmark or test evidence for runtime changes.

License

DiskLLM is released under the MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diskllm-0.1.1.tar.gz (31.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diskllm-0.1.1-py3-none-any.whl (32.3 kB view details)

Uploaded Python 3

File details

Details for the file diskllm-0.1.1.tar.gz.

File metadata

  • Download URL: diskllm-0.1.1.tar.gz
  • Upload date:
  • Size: 31.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for diskllm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bb3ab7ef186a017bc62dd0df3a58beeba77986903c0beffceb300255d31513c9
MD5 65367b8050d37a27115aa42fb55969d7
BLAKE2b-256 da65faeb44aaeac55fbe4f991442e8beb2afbda8fe093b2115551b621489a5cc

See more details on using hashes here.

File details

Details for the file diskllm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: diskllm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 32.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for diskllm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 09ef167d256d0eae9eccec674d30dfaaa84943717b2e3e5cdabd3442216d22bc
MD5 3b532287427d6bbae37483ae6cdd24d5
BLAKE2b-256 4c7ed6bf4d29ea02fe4b3ec1727d9c0658ff18e5908d3205e06a4c26c43960c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page