LLM Inference for Large-Context Offline Workloads
Project description
LLM Inference for Large-Context Offline Workloads
About
oLLM is a lightweight Python library for large-context LLM inference, built on top of Transformers and PyTorch. It enables running models like Llama-3.1-8B-Instruct with 100k context on a ~$200 consumer GPU (8GB VRAM) by offloading layers to SSD. Example performance: ~20 min for the first token, ~17s per subsequent token. No quantization is used—only fp16 precision.
8GB 3060 Ti 100k context inference memory usage:
| Model | Weights | KV cache | Hidden states | Baseline VRAM (no offload) | oLLM GPU VRAM | oLLM Disk (SSD) |
|---|---|---|---|---|---|---|
| llama3-1B-chat | 2 GB (fp16) | 12.6 GB | 0.4 GB | ~16 GB | ~5 GB | 18 GB |
| llama3-3B-chat | 7 GB (fp16) | 34.1 GB | 0.61 GB | ~42 GB | ~5.3 GB | 45 GB |
| llama3-8B-chat | 16 GB (fp16) | 52.4 GB | 0.8 GB | ~71 GB | ~6.6 GB | 75 GB |
| gpt-oss-20B | 13 GB (MXFP4) | Coming.. | Coming.. |
By "Baseline" we mean typical inference without any offloading. It's VRAM usage does not include full attention materialization (it would be 600GB)
How do we achieve this:
- Loading layer weights from SSD directly to GPU one by one
- Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
- Chunked attention with online softmax. Full attention matrix is never materialized.
- Chunked MLP. intermediate upper projection layers may get large, so we chunk MLP as well
Getting Started
It is recommended to create venv or conda environment first
python3 -m venv ollm_env
source ollm_env/bin/activate
Install oLLM with pip or from source:
git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install -e .
pip install kvikio-cu{cuda_version} Ex, kvikio-cu12
Example
Code snippet sample
from ollm import Inference, KVCache
o = Inference("llama3-1B-chat", device="cuda:0") #only GPU supported
o.ini_model(models_dir="./models/", force_download=False)
messages = [{"role":"system", "content":"You are helpful AI assistant"}, {"role":"user", "content":"List planets"}]
prompt = o.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = o.tokenizer(prompt, return_tensors="pt").to(o.device)
past_key_values = KVCache(cache_dir="./kv_cache/", stats=o.stats) #None if context is small
outputs = o.model.generate(**inputs, past_key_values=past_key_values, max_new_tokens=20).cpu()
answer = o.tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:])
print(answer)
or run sample python script as PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py
Contact us
If there’s a model you’d like to see supported, feel free to reach out at anuarsh@ailabs.us—I’ll do my best to make it happen.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ollm-0.1.3.tar.gz.
File metadata
- Download URL: ollm-0.1.3.tar.gz
- Upload date:
- Size: 16.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18ef1c373ee524846bd32fd0516005808cd1e2cb4f70423b811c912ddd7a2e32
|
|
| MD5 |
18b1eac775bae063723db9587e89eb68
|
|
| BLAKE2b-256 |
8743596de505daae69fee295b974b34bccc41c21664e588537234af8408610dc
|
File details
Details for the file ollm-0.1.3-py3-none-any.whl.
File metadata
- Download URL: ollm-0.1.3-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6719a3c9a63e1d6e78946c633a251d42536fe1f7b8955cdc4a3c006d6ae719ec
|
|
| MD5 |
a58d15915d7853700a8951eb3f901dc2
|
|
| BLAKE2b-256 |
4a35be099e264b2497fd5980d4d9b423fa465c4928fff0795f5d922216c7ef50
|