ollm · PyPI

LLM Inference for Large-Context Offline Workloads

Project description

oLLM

LLM Inference for Large-Context Offline Workloads

oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like gpt-oss-20B, qwen3-next-80B or Llama-3.1-8B-Instruct on 100k context using ~$200 consumer GPU with 8GB VRAM. No quantization is used—only fp16/bf16 precision.

Latest updates (1.0.3) 🔥

AutoInference with any Llama3 / gemma3 model + PEFT adapter support
kvikio and flash-attn are optional now, meaning no hardware restrictions beyond HF transformers
Llama3 models use original HF files now (make sure to delete the existing model folder(llama3-*) before running it)
Multimodal voxtral-small-24B (audio+text) added. [sample with audio]
Multimodal gemma3-12B (image+text) added. [sample with image]
qwen3-next-80B (160GB model) added with ⚡️1tok/2s throughput (our fastest model so far)
gpt-oss-20B flash-attention-like implementation added to reduce VRAM usage
gpt-oss-20B chunked MLP added to reduce VRAM usage

8GB Nvidia 3060 Ti Inference memory usage:

Model	Weights	Context length	KV cache	Baseline VRAM (no offload)	oLLM GPU VRAM	oLLM Disk (SSD)
qwen3-next-80B	160 GB (bf16)	50k	20 GB	~190 GB	~7.5 GB	180 GB
gpt-oss-20B	13 GB (packed bf16)	10k	1.4 GB	~40 GB	~7.3GB	15 GB
gemma3-12B	25 GB (bf16)	50k	18.5 GB	~45 GB	~6.7 GB	43 GB
llama3-1B-chat	2 GB (bf16)	100k	12.6 GB	~16 GB	~5 GB	15 GB
llama3-3B-chat	7 GB (bf16)	100k	34.1 GB	~42 GB	~5.3 GB	42 GB
llama3-8B-chat	16 GB (bf16)	100k	52.4 GB	~71 GB	~6.6 GB	69 GB

By "Baseline" we mean typical inference without any offloading

How do we achieve this:

Loading layer weights from SSD directly to GPU one by one
Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
Offloading layer weights to CPU if needed
FlashAttention-2 with online softmax. Full attention matrix is never materialized.
Chunked MLP. Intermediate upper projection layers may get large, so we chunk MLP as well

Typical use cases include:

Analyze contracts, regulations, and compliance reports in one pass
Summarize or extract insights from massive patient histories or medical literature
Process very large log files or threat reports locally
Analyze historical chats to extract the most common issues/questions users have

Supported GPUs: NVIDIA (with additional performance benefits from kvikio and flash-attn), AMD, and Apple Silicon (MacBook).

Getting Started

It is recommended to create venv or conda environment first

python3 -m venv ollm_env
source ollm_env/bin/activate

Install oLLM with pip install --no-build-isolation ollm or from source:

git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install --no-build-isolation -e .

# for Nvidia GPUs with cuda (optional): 
pip install kvikio-cu{cuda_version} Ex, kvikio-cu12 #speeds up the inference

💡 Note
voxtral-small-24B requires additional pip dependencies to be installed as pip install "mistral-common[audio]" and pip install librosa

Check out the Troubleshooting in case of any installation issues

Example

Code snippet sample

from ollm import Inference, file_get_contents, TextStreamer
o = Inference("llama3-1B-chat", device="cuda:0", logging=True) #llama3-1B/3B/8B-chat, gpt-oss-20B, qwen3-next-80B
o.ini_model(models_dir="./models/", force_download=False)
o.offload_layers_to_cpu(layers_num=2) #(optional) offload some layers to CPU for speed boost
past_key_values = o.DiskCache(cache_dir="./kv_cache/") #set None if context is small
text_streamer = TextStreamer(o.tokenizer, skip_prompt=True, skip_special_tokens=False)

messages = [{"role":"system", "content":"You are helpful AI assistant"}, {"role":"user", "content":"List planets"}]
input_ids = o.tokenizer.apply_chat_template(messages, reasoning_effort="minimal", tokenize=True, add_generation_prompt=True, return_tensors="pt").to(o.device)
outputs = o.model.generate(input_ids=input_ids,  past_key_values=past_key_values, max_new_tokens=500, streamer=text_streamer).cpu()
answer = o.tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=False)
print(answer)

or run sample python script as PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py

# with AutoInference, you can run any LLama3/gemma3 model with PEFT adapter support
# pip install peft 
from ollm import AutoInference
o = AutoInference("./models/gemma3-12B", # any llama3 or gemma3 model
  adapter_dir="./myadapter/checkpoint-20", # PEFT adapter checkpoint if available
  device="cuda:0", multimodality=False, logging=True)
...

More samples

Knowledge base

Documentation
Community articles, video, blogs
Troubleshooting

Roadmap

For visibility of what's coming next (subject to change)

Qwen3-Next quantized version
Qwen3-VL or alternative vision model
Qwen3-Next MultiTokenPrediction in R&D

Contact us

If there’s a model you’d like to see supported, feel free to suggest it in the discussion — I’ll do my best to make it happen.

Project details

Release history Release notifications | RSS feed

This version

1.0.3

Oct 31, 2025

1.0.2

Oct 28, 2025

1.0.1

Oct 20, 2025

1.0.0

Oct 13, 2025

0.5.2

Oct 8, 2025

0.5.0

Oct 1, 2025

0.4.2

Sep 28, 2025

0.4.0

Sep 19, 2025

0.3.0

Sep 10, 2025

0.2.1

Sep 4, 2025

0.1.3

Aug 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollm-1.0.3.tar.gz (32.2 kB view details)

Uploaded Oct 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ollm-1.0.3-py3-none-any.whl (34.8 kB view details)

Uploaded Oct 31, 2025 Python 3

File details

Details for the file ollm-1.0.3.tar.gz.

File metadata

Download URL: ollm-1.0.3.tar.gz
Upload date: Oct 31, 2025
Size: 32.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ollm-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`5b349d775dbad59b8db3d2c1d38645d618de169ef5158da5540da6d24ee56070`
MD5	`0660220b8321243370423b4dcf049478`
BLAKE2b-256	`8f0b29dcc6ac56a1182020b4775dd007bb9c9d7c34d844b597100b5618ed084e`

See more details on using hashes here.

File details

Details for the file ollm-1.0.3-py3-none-any.whl.

File metadata

Download URL: ollm-1.0.3-py3-none-any.whl
Upload date: Oct 31, 2025
Size: 34.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ollm-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e7707724ff215a030853fbe1012807bfc673bb58bab40351d63a542f2b9f824f`
MD5	`0b0d52f4a08af1095979852f2119b3bb`
BLAKE2b-256	`4865a5d53865afd340c95701fd89625574771d9ed9df981fc724ee7f19022452`

See more details on using hashes here.

ollm 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

LLM Inference for Large-Context Offline Workloads

8GB Nvidia 3060 Ti Inference memory usage:

Getting Started

Example

Knowledge base

Roadmap

Contact us

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes