Skip to main content

LLM Inference for Large-Context Offline Workloads

Project description

vLLM

LLM Inference for Large-Context Offline Workloads

oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like gpt-oss-20B, qwen3-next-80B or Llama-3.1-8B-Instruct on 100k context using ~$200 consumer GPU with 8GB VRAM. No quantization is used—only fp16/bf16 precision.

Latest updates (1.0.0) 🔥

  • kvikio and flash-attn are optional now, meaning no hardware restrictions beyond HF transformers
  • Llama3 models use original HF files now (make sure to redownload the model using force_download=True)
  • Multimodal voxtral-small-24B (audio+text) added. [sample with audio]
  • Multimodal gemma3-12B (image+text) added. [sample with image]
  • qwen3-next-80B (160GB model) added with ⚡️1tok/2s throughput (our fastest model so far)
  • gpt-oss-20B flash-attention-like implementation added to reduce VRAM usage
  • gpt-oss-20B chunked MLP added to reduce VRAM usage

8GB Nvidia 3060 Ti Inference memory usage:

Model Weights Context length KV cache Baseline VRAM (no offload) oLLM GPU VRAM oLLM Disk (SSD)
qwen3-next-80B 160 GB (bf16) 50k 20 GB ~190 GB ~7.5 GB 180 GB
gpt-oss-20B 13 GB (packed bf16) 10k 1.4 GB ~40 GB ~7.3GB 15 GB
gemma3-12B 25 GB (bf16) 50k 18.5 GB ~45 GB ~6.7 GB 43 GB
llama3-1B-chat 2 GB (bf16) 100k 12.6 GB ~16 GB ~5 GB 15 GB
llama3-3B-chat 7 GB (bf16) 100k 34.1 GB ~42 GB ~5.3 GB 42 GB
llama3-8B-chat 16 GB (bf16) 100k 52.4 GB ~71 GB ~6.6 GB 69 GB

By "Baseline" we mean typical inference without any offloading

How do we achieve this:

  • Loading layer weights from SSD directly to GPU one by one
  • Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
  • Offloading layer weights to CPU if needed
  • FlashAttention-2 with online softmax. Full attention matrix is never materialized.
  • Chunked MLP. Intermediate upper projection layers may get large, so we chunk MLP as well

Typical use cases include:

  • Analyze contracts, regulations, and compliance reports in one pass
  • Summarize or extract insights from massive patient histories or medical literature
  • Process very large log files or threat reports locally
  • Analyze historical chats to extract the most common issues/questions users have

Supported GPUs: NVIDIA (with additional performance benefits from kvikio and flash-attn), AMD, and Apple Silicon (MacBook).

Getting Started

It is recommended to create venv or conda environment first

python3 -m venv ollm_env
source ollm_env/bin/activate

Install oLLM with pip install --no-build-isolation ollm or from source:

git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install --no-build-isolation -e .

# for Nvidia GPUs with cuda (optional): 
pip install kvikio-cu{cuda_version} Ex, kvikio-cu12 #speeds up the inference

💡 Note
voxtral-small-24B requires additional pip dependencies to be installed as pip install "mistral-common[audio]" and pip install librosa

Check out the Troubleshooting in case of any installation issues

Example

Code snippet sample

from ollm import Inference, file_get_contents, TextStreamer
o = Inference("llama3-1B-chat", device="cuda:0", logging=True) #llama3-1B/3B/8B-chat, gpt-oss-20B, qwen3-next-80B
o.ini_model(models_dir="./models/", force_download=False)
o.offload_layers_to_cpu(layers_num=2) #(optional) offload some layers to CPU for speed boost
past_key_values = o.DiskCache(cache_dir="./kv_cache/") #set None if context is small
text_streamer = TextStreamer(o.tokenizer, skip_prompt=True, skip_special_tokens=False)

messages = [{"role":"system", "content":"You are helpful AI assistant"}, {"role":"user", "content":"List planets"}]
input_ids = o.tokenizer.apply_chat_template(messages, reasoning_effort="minimal", tokenize=True, add_generation_prompt=True, return_tensors="pt").to(o.device)
outputs = o.model.generate(input_ids=input_ids,  past_key_values=past_key_values, max_new_tokens=500, streamer=text_streamer).cpu()
answer = o.tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=False)
print(answer)

or run sample python script as PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py

More samples

Knowledge base

Roadmap

For visibility of what's coming next (subject to change)

  • Qwen3-Next quantized version
  • Qwen3-VL or alternative vision model
  • Qwen3-Next MultiTokenPrediction in R&D

Contact us

If there’s a model you’d like to see supported, feel free to suggest it in the discussion — I’ll do my best to make it happen.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollm-1.0.0.tar.gz (31.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ollm-1.0.0-py3-none-any.whl (34.0 kB view details)

Uploaded Python 3

File details

Details for the file ollm-1.0.0.tar.gz.

File metadata

  • Download URL: ollm-1.0.0.tar.gz
  • Upload date:
  • Size: 31.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ollm-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c0425fe71fc548d96943990553ba59ab80a6c99f923937627ee29e9b32f33c2d
MD5 afde15991b989a094e8f477ff3c70f23
BLAKE2b-256 0209745c5113c00daa4334079115f238420a28a9f209ceabc4a3e1224723771e

See more details on using hashes here.

File details

Details for the file ollm-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: ollm-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 34.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ollm-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed42ca336e7ffbb7b71024097ebef51c79db1e66eccbdf71547a455d2f1ea555
MD5 39a19655b8f04d6026007f673f929ac0
BLAKE2b-256 fce429cc419ca5d50a41f86e3790f9fbdadfe0db44d51e5303c6142a9a701351

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page