Ollama-style GGUF runner with SSD-backed llama.cpp KV cache

These details have not been verified by PyPI

Project links

Project description

DiskLLM

Run a 7B LLM on 258MB RAM. Stock llama.cpp needs 6000MB. DiskLLM uses your SSD instead.

DiskLLM architecture

DiskLLM is an Ollama-style Python CLI and patched llama.cpp backend that keeps the KV cache on SSD through mmap instead of allocating the full cache in private RAM. It launches an OpenAI-compatible llama-server, downloads GGUF models from Ollama or HuggingFace, and persists context sessions across restarts.

Results

Model	Context	Stock RAM	DiskLLM RAM	Reduction	tok/s
Qwen2.5 3B	65K	~2,000 MB	~200 MB	10x	8.9
Qwen2.5 7B	65K	~6,000 MB	258 MB	23x	4.5
Qwen2.5 7B	256K	14,900 MB	258 MB	57x	2.5
LFM2.5 8B	128K	1,845 MB	172 MB	10.7x	10.0
Qwen2.5 14B	65K	~10,000 MB	289 MB	34x	1.7
Qwen2.5 32B	65K	~20,000 MB	424 MB	47x	0.1

Metric	Value
Max RAM reduction	57x (256K context)
Min private RAM	172 MB (LFM2.5 8B)
Best tok/s	10.0 (LFM2.5 8B)
Largest model run	32B on 2.5GB free RAM
KV cache on SSD	112 MB (vs GBs in RAM)
Session file size	7.33 MB (persists across restarts)

Quick Start

pip install diskllm
diskllm pull qwen2.5:7b
diskllm run qwen2.5:7b --session demo --headless
diskllm chat qwen2.5:7b --session demo --no-start

The server exposes an OpenAI-compatible API at http://127.0.0.1:8080/v1.

How It Works

Stock llama.cpp allocates KV tensors in RAM. At long context, that KV allocation can dominate system memory even when model weights are memory-mapped.

DiskLLM adds a patched llama.cpp --kv-backend ssd mode. The backend creates a CPU-addressable mmap file under ~/.diskllm/kv_cache, keeps only the active attention window hot, and lets the OS page the rest through NVMe storage. The patch is wired through dense KV, SWA, hybrid, and iSWA cache paths.

diskllm run always starts the patched server with:

-ngl 0
--kv-backend ssd
--kv-path ~/.diskllm/kv_cache
--kv-window 2048
--host 0.0.0.0
--port 8080
--no-repack

When --session NAME is used, DiskLLM also passes --slot-save-path ~/.diskllm/sessions and uses llama-server slot save/restore APIs to persist context as ~/.diskllm/sessions/<name>.mmap.

Benchmarks

Benchmarks were run CPU-only on Windows 11 with an Intel i5-12450H, 16GB RAM, and NVMe SSD.

Test	Result
Qwen2.5 7B, 65K context, headless startup	258 MB private RAM
Qwen2.5 7B, restored session startup	3.46s to healthy plus restore
Qwen2.5 7B, restored prompt reuse	34 cached prompt tokens
Qwen2.5 7B, restored generation	7.13 tok/s

See docs/benchmarks.md for the full tables and hardware notes.

Model Registry

qwen2.5:3b   bartowski/Qwen2.5-3B-Instruct-GGUF Q4_K_M
qwen2.5:7b   bartowski/Qwen2.5-7B-Instruct-GGUF Q4_K_M
qwen2.5:14b  bartowski/Qwen2.5-14B-Instruct-GGUF Q4_K_M
lfm2.5:8b    LiquidAI/LFM2.5-8B-A1B-GGUF Q4_K_M

DiskLLM can pull models from Ollama or HuggingFace. By default, diskllm pull tries Ollama first when the ollama CLI is installed, then falls back to HuggingFace. Force a source with:

diskllm pull qwen2.5:7b --source ollama
diskllm pull qwen2.5:7b --source huggingface

For HuggingFace GGUF repos, pass the repo name directly. DiskLLM lists repo files and picks Q4_K_M by default:

diskllm pull bartowski/Qwen2.5-7B-Instruct-GGUF
diskllm pull bartowski/Qwen2.5-7B-Instruct-GGUF:Q5_K_M

Storage

~/.diskllm/models/      downloaded GGUF files
~/.diskllm/kv_cache/    SSD KV mmap files
~/.diskllm/sessions/    persistent slot KV sessions
~/.diskllm/config.json  settings and installed model index
~/.diskllm/logs/        background server logs
~/.diskllm/bin/         patched llama-server binary

Override the home directory with DISKLLM_HOME.

Patched llama.cpp

The llama.cpp/ directory is tracked as a git submodule containing the patched fork. DiskLLM finds the patched server binary in this order:

~/.diskllm/bin/llama-server.exe
DISKLLM_LLAMA_SERVER
llama_server in ~/.diskllm/config.json
bundled package path diskllm/bin/windows/llama-server.exe
development checkout path llama.cpp/build/bin/llama-server.exe

If a binary is found outside ~/.diskllm, DiskLLM copies it into ~/.diskllm/bin and runs that copy.

OpenAI Client

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="diskllm")
reply = client.chat.completions.create(
    model="diskllm",
    messages=[{"role": "user", "content": "What is DiskLLM?"}],
)
print(reply.choices[0].message.content)

More examples are in examples/.

Contributing

Contributions are welcome. Start with CONTRIBUTING.md, keep changes small, and include benchmark or test evidence for runtime changes.

License

DiskLLM is released under the MIT License. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jun 4, 2026

0.1.0

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diskllm-0.1.1.tar.gz (31.5 kB view details)

Uploaded Jun 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

diskllm-0.1.1-py3-none-any.whl (32.3 kB view details)

Uploaded Jun 4, 2026 Python 3

File details

Details for the file diskllm-0.1.1.tar.gz.

File metadata

Download URL: diskllm-0.1.1.tar.gz
Upload date: Jun 4, 2026
Size: 31.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for diskllm-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`bb3ab7ef186a017bc62dd0df3a58beeba77986903c0beffceb300255d31513c9`
MD5	`65367b8050d37a27115aa42fb55969d7`
BLAKE2b-256	`da65faeb44aaeac55fbe4f991442e8beb2afbda8fe093b2115551b621489a5cc`

See more details on using hashes here.

File details

Details for the file diskllm-0.1.1-py3-none-any.whl.

File metadata

Download URL: diskllm-0.1.1-py3-none-any.whl
Upload date: Jun 4, 2026
Size: 32.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for diskllm-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`09ef167d256d0eae9eccec674d30dfaaa84943717b2e3e5cdabd3442216d22bc`
MD5	`3b532287427d6bbae37483ae6cdd24d5`
BLAKE2b-256	`4c7ed6bf4d29ea02fe4b3ec1727d9c0658ff18e5908d3205e06a4c26c43960c7`

See more details on using hashes here.

diskllm 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DiskLLM

Results

Quick Start

How It Works

Benchmarks

Model Registry

Storage

Patched llama.cpp

OpenAI Client

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes