Skip to main content

LLM Inference for Large-Context Offline Workloads

Project description

vLLM

LLM Inference for Large-Context Offline Workloads


About

oLLM is a lightweight Python library for large-context LLM inference, built on top of Transformers and PyTorch. It enables running models like Llama-3.1-8B-Instruct with 100k context on a ~$200 consumer GPU (8GB VRAM) by offloading layers to SSD. Example performance: ~20 min for the first token, ~17s per subsequent token. No quantization is used—only fp16 precision.

8GB 3060 Ti 100k context inference memory usage:

Model Weights KV cache Hidden states Baseline VRAM (no offload) oLLM GPU VRAM oLLM Disk (SSD)
llama3-1B-chat 2 GB (fp16) 12.6 GB 0.4 GB ~16 GB ~5 GB 18 GB
llama3-3B-chat 7 GB (fp16) 34.1 GB 0.61 GB ~42 GB ~5.3 GB 45 GB
llama3-8B-chat 16 GB (fp16) 52.4 GB 0.8 GB ~71 GB ~6.6 GB 75 GB
gpt-oss-20B 13 GB (MXFP4) Coming.. Coming..

By "Baseline" we mean typical inference without any offloading. It's VRAM usage does not include full attention materialization (it would be 600GB)

How do we achieve this:

  • Loading layer weights from SSD directly to GPU one by one
  • Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
  • Chunked attention with online softmax. Full attention matrix is never materialized.
  • Chunked MLP. intermediate upper projection layers may get large, so we chunk MLP as well

Getting Started

It is recommended to create venv or conda environment first

python3 -m venv ollm_env
source ollm_env/bin/activate

Install oLLM with pip or from source:

git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install -e .
pip install kvikio-cu{cuda_version} Ex, kvikio-cu12

Example

Code snippet sample

from ollm import Inference, KVCache
o = Inference("llama3-1B-chat", device="cuda:0") #only GPU supported
o.ini_model(models_dir="./models/", force_download=False)
messages = [{"role":"system", "content":"You are helpful AI assistant"}, {"role":"user", "content":"List planets"}]
prompt = o.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = o.tokenizer(prompt, return_tensors="pt").to(o.device)
past_key_values = KVCache(cache_dir="./kv_cache/", stats=o.stats) #None if context is small 
outputs = o.model.generate(**inputs,  past_key_values=past_key_values, max_new_tokens=20).cpu()
answer = o.tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:])
print(answer)

or run sample python script as PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py

Contact us

If there’s a model you’d like to see supported, feel free to reach out at anuarsh@ailabs.us—I’ll do my best to make it happen.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollm-0.1.3.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ollm-0.1.3-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file ollm-0.1.3.tar.gz.

File metadata

  • Download URL: ollm-0.1.3.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ollm-0.1.3.tar.gz
Algorithm Hash digest
SHA256 18ef1c373ee524846bd32fd0516005808cd1e2cb4f70423b811c912ddd7a2e32
MD5 18b1eac775bae063723db9587e89eb68
BLAKE2b-256 8743596de505daae69fee295b974b34bccc41c21664e588537234af8408610dc

See more details on using hashes here.

File details

Details for the file ollm-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: ollm-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ollm-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6719a3c9a63e1d6e78946c633a251d42536fe1f7b8955cdc4a3c006d6ae719ec
MD5 a58d15915d7853700a8951eb3f901dc2
BLAKE2b-256 4a35be099e264b2497fd5980d4d9b423fa465c4928fff0795f5d922216c7ef50

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page