LLM Inference for Large-Context Offline Workloads
Project description
LLM Inference for Large-Context Offline Workloads
oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like gpt-oss-20B, qwen3-next-80B or Llama-3.1-8B-Instruct on 100k context using ~$200 consumer GPU with 8GB VRAM. No quantization is used—only fp16/bf16 precision.
Latest updates (0.5.2) 🔥
- Multimodal voxtral-small-24B (audio+text) added. [sample with audio]
- Multimodal gemma3-12B (image+text) added. [sample with image]
- qwen3-next-80B DiskCache support added
- qwen3-next-80B (160GB model) added with ⚡️1tok/2s throughput (our fastest model so far)
- gpt-oss-20B flash-attention-like implementation added to reduce VRAM usage
- gpt-oss-20B chunked MLP added to reduce VRAM usage
8GB Nvidia 3060 Ti Inference memory usage:
| Model | Weights | Context length | KV cache | Baseline VRAM (no offload) | oLLM GPU VRAM | oLLM Disk (SSD) |
|---|---|---|---|---|---|---|
| qwen3-next-80B | 160 GB (bf16) | 50k | 20 GB | ~190 GB | ~7.5 GB | 180 GB |
| gpt-oss-20B | 13 GB (packed bf16) | 10k | 1.4 GB | ~40 GB | ~7.3GB | 15 GB |
| gemma3-12B | 25 GB (bf16) | 50k | 18.5 GB | ~45 GB | ~6.7 GB | 43 GB |
| llama3-1B-chat | 2 GB (fp16) | 100k | 12.6 GB | ~16 GB | ~5 GB | 15 GB |
| llama3-3B-chat | 7 GB (fp16) | 100k | 34.1 GB | ~42 GB | ~5.3 GB | 42 GB |
| llama3-8B-chat | 16 GB (fp16) | 100k | 52.4 GB | ~71 GB | ~6.6 GB | 69 GB |
By "Baseline" we mean typical inference without any offloading
How do we achieve this:
- Loading layer weights from SSD directly to GPU one by one
- Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
- Offloading layer weights to CPU if needed
- FlashAttention-2 with online softmax. Full attention matrix is never materialized.
- Chunked MLP. Intermediate upper projection layers may get large, so we chunk MLP as well
Typical use cases include:
- Analyze contracts, regulations, and compliance reports in one pass
- Summarize or extract insights from massive patient histories or medical literature
- Process very large log files or threat reports locally
- Analyze historical chats to extract the most common issues/questions users have
Supported Nvidia GPUs: Ampere (RTX 30xx, A30, A4000, A10), Ada Lovelace (RTX 40xx, L4), Hopper (H100), and newer
Getting Started
It is recommended to create venv or conda environment first
python3 -m venv ollm_env
source ollm_env/bin/activate
Install oLLM with pip install ollm or from source:
git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install -e .
pip install kvikio-cu{cuda_version} Ex, kvikio-cu12
💡 Note
voxtral-small-24B requires additional pip dependencies to be installed aspip install "mistral-common[audio]"andpip install librosa
Example
Code snippet sample
from ollm import Inference, file_get_contents, TextStreamer
o = Inference("llama3-1B-chat", device="cuda:0", logging=True) #llama3-1B/3B/8B-chat, gpt-oss-20B, qwen3-next-80B
o.ini_model(models_dir="./models/", force_download=False)
o.offload_layers_to_cpu(layers_num=2) #(optional) offload some layers to CPU for speed boost
past_key_values = o.DiskCache(cache_dir="./kv_cache/") #set None if context is small
text_streamer = TextStreamer(o.tokenizer, skip_prompt=True, skip_special_tokens=False)
messages = [{"role":"system", "content":"You are helpful AI assistant"}, {"role":"user", "content":"List planets"}]
input_ids = o.tokenizer.apply_chat_template(messages, reasoning_effort="minimal", tokenize=True, add_generation_prompt=True, return_tensors="pt").to(o.device)
outputs = o.model.generate(input_ids=input_ids, past_key_values=past_key_values, max_new_tokens=500, streamer=text_streamer).cpu()
answer = o.tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=False)
print(answer)
or run sample python script as PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py
More samples
Roadmap
For visibility of what's coming next (subject to change)
- Qwen3-Next quantized version
- Qwen3-VL or alternative vision model
- Qwen3-Next MultiTokenPrediction in R&D
Contact us
If there’s a model you’d like to see supported, feel free to suggest it in the discussion — I’ll do my best to make it happen.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ollm-0.5.2.tar.gz.
File metadata
- Download URL: ollm-0.5.2.tar.gz
- Upload date:
- Size: 31.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
870d9cc83756537355752342d39d51e612a965890adb837d99260faf03cdb64a
|
|
| MD5 |
39bc5921638ade7696bfa219239bac4f
|
|
| BLAKE2b-256 |
68faa6b410363688de0c88d147ab8acf68167f08292a6e4021bc87d30c426ce8
|
File details
Details for the file ollm-0.5.2-py3-none-any.whl.
File metadata
- Download URL: ollm-0.5.2-py3-none-any.whl
- Upload date:
- Size: 34.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b40ec8bf94374a5e4376ce5dce08749f4f229a5175347d67f2076991cb1b08b
|
|
| MD5 |
0db3afe2f4bb804e5360b5760c19a7a1
|
|
| BLAKE2b-256 |
50876405bf60880961ca3fdeb25c99703b1f221b2cd81353a55359606abc0920
|