ollm · PyPI

LLM Inference for Large-Context Offline Workloads

Project description

vLLM

LLM Inference for Large-Context Offline Workloads

About

oLLM is a lightweight Python library for large-context LLM inference, built on top of Transformers and PyTorch. It enables running models like Llama-3.1-8B-Instruct with 100k context on a ~$200 consumer GPU (8GB VRAM) by offloading layers to SSD. Example performance: ~20 min for the first token, ~17s per subsequent token. No quantization is used—only fp16 precision.

8GB 3060 Ti 100k context inference memory usage:

Model	Weights	KV cache	Hidden states	Baseline VRAM (no offload)	oLLM GPU VRAM	oLLM Disk (SSD)
llama3-1B-chat	2 GB (fp16)	12.6 GB	0.4 GB	~16 GB	~5 GB	18 GB
llama3-3B-chat	7 GB (fp16)	34.1 GB	0.61 GB	~42 GB	~5.3 GB	45 GB
llama3-8B-chat	16 GB (fp16)	52.4 GB	0.8 GB	~71 GB	~6.6 GB	75 GB
gpt-oss-20B	13 GB (MXFP4)				Coming..	Coming..

By "Baseline" we mean typical inference without any offloading. It's VRAM usage does not include full attention materialization (it would be 600GB)

How do we achieve this:

Loading layer weights from SSD directly to GPU one by one
Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
Chunked attention with online softmax. Full attention matrix is never materialized.
Chunked MLP. intermediate upper projection layers may get large, so we chunk MLP as well

Getting Started

It is recommended to create venv or conda environment first

python3 -m venv ollm_env
source ollm_env/bin/activate

Install oLLM with pip or from source:

git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install -e .
pip install kvikio-cu{cuda_version} Ex, kvikio-cu12

Example

Code snippet sample

from ollm import Inference, KVCache
o = Inference("llama3-1B-chat", device="cuda:0") #only GPU supported
o.ini_model(models_dir="./models/", force_download=False)
messages = [{"role":"system", "content":"You are helpful AI assistant"}, {"role":"user", "content":"List planets"}]
prompt = o.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = o.tokenizer(prompt, return_tensors="pt").to(o.device)
past_key_values = KVCache(cache_dir="./kv_cache/", stats=o.stats) #None if context is small 
outputs = o.model.generate(**inputs,  past_key_values=past_key_values, max_new_tokens=20).cpu()
answer = o.tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:])
print(answer)

or run sample python script as PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py

Contact us

If there’s a model you’d like to see supported, feel free to reach out at anuarsh@ailabs.us—I’ll do my best to make it happen.

Project details

Release history Release notifications | RSS feed

1.0.3

Oct 31, 2025

1.0.2

Oct 28, 2025

1.0.1

Oct 20, 2025

1.0.0

Oct 13, 2025

0.5.2

Oct 8, 2025

0.5.0

Oct 1, 2025

0.4.2

Sep 28, 2025

0.4.0

Sep 19, 2025

0.3.0

Sep 10, 2025

0.2.1

Sep 4, 2025

This version

0.1.3

Aug 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollm-0.1.3.tar.gz (16.9 kB view details)

Uploaded Aug 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ollm-0.1.3-py3-none-any.whl (16.9 kB view details)

Uploaded Aug 27, 2025 Python 3

File details

Details for the file ollm-0.1.3.tar.gz.

File metadata

Download URL: ollm-0.1.3.tar.gz
Upload date: Aug 27, 2025
Size: 16.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ollm-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`18ef1c373ee524846bd32fd0516005808cd1e2cb4f70423b811c912ddd7a2e32`
MD5	`18b1eac775bae063723db9587e89eb68`
BLAKE2b-256	`8743596de505daae69fee295b974b34bccc41c21664e588537234af8408610dc`

See more details on using hashes here.

File details

Details for the file ollm-0.1.3-py3-none-any.whl.

File metadata

Download URL: ollm-0.1.3-py3-none-any.whl
Upload date: Aug 27, 2025
Size: 16.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ollm-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6719a3c9a63e1d6e78946c633a251d42536fe1f7b8955cdc4a3c006d6ae719ec`
MD5	`a58d15915d7853700a8951eb3f901dc2`
BLAKE2b-256	`4a35be099e264b2497fd5980d4d9b423fa465c4928fff0795f5d922216c7ef50`

See more details on using hashes here.

ollm 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

LLM Inference for Large-Context Offline Workloads

About

8GB 3060 Ti 100k context inference memory usage:

Getting Started

Example

Contact us

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes