Skip to main content

LLM Inference for Large-Context Offline Workloads

Project description

vLLM

LLM Inference for Large-Context Offline Workloads


About

oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like Llama-3.1-8B-Instruct on 100k context using ~$200 consumer GPU with 8GB VRAM. gpt-oss-20B is also supported (large-context is coming soon). No quantization is used—only fp16/bf16 precision.

8GB Nvidia 3060 Ti 100k context inference memory usage:

Model Weights KV cache Hidden states Baseline VRAM (no offload) oLLM GPU VRAM oLLM Disk (SSD)
llama3-1B-chat 2 GB (fp16) 12.6 GB 0.4 GB ~16 GB ~5 GB 18 GB
llama3-3B-chat 7 GB (fp16) 34.1 GB 0.61 GB ~42 GB ~5.3 GB 45 GB
llama3-8B-chat 16 GB (fp16) 52.4 GB 0.8 GB ~71 GB ~6.6 GB 75 GB
gpt-oss-20B 13 GB (packed bf16) 0.6GB ~6.4GB, large context support is on the way 20GB

By "Baseline" we mean typical inference without any offloading

How do we achieve this:

  • Loading layer weights from SSD directly to GPU one by one
  • Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
  • Offloading layer weights to CPU if needed
  • FlashAttention-2 with online softmax. Full attention matrix is never materialized.
  • Chunked MLP. Intermediate upper projection layers may get large, so we chunk MLP as well

Typical use cases include:

  • Analyze contracts, regulations, and compliance reports in one pass
  • Summarize or extract insights from massive patient histories or medical literature
  • Process very large log files or threat reports locally
  • Analyze historical chats to extract the most common issues/questions users have

Supported Nvidia GPUs: Turing (T4, RTX 20-series, Quadro RTX 6000/8000) -- only Llama3, RTX 30xx, RTX 40xx, L4, A10, and newer

Getting Started

It is recommended to create venv or conda environment first

python3 -m venv ollm_env
source ollm_env/bin/activate

Install oLLM with pip install ollm or from source:

git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install -e .
pip install kvikio-cu{cuda_version} Ex, kvikio-cu12

Example

Code snippet sample

from ollm import Inference, KVCache
o = Inference("llama3-1B-chat", device="cuda:0") #llama3-1B/3B/8B-chat, gpt-oss-20B
o.ini_model(models_dir="./models/", force_download=False)
o.offload_layers_to_cpu(layers_num=2) #(optional) offload some layers to CPU for speed increase
past_key_values = KVCache(cache_dir="./kv_cache/", stats=o.stats) #set None for small context

messages = [{"role":"system", "content":"You are helpful AI assistant"}, {"role":"user", "content":"List planets"}]
input_ids = o.tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(o.device)
outputs = o.model.generate(input_ids=input_ids,  past_key_values=past_key_values, max_new_tokens=20).cpu()
answer = o.tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=False)
print(answer)

or run sample python script as PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py

Contact us

If there’s a model you’d like to see supported, feel free to reach out at anuarsh@ailabs.us—I’ll do my best to make it happen.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollm-0.2.1.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ollm-0.2.1-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file ollm-0.2.1.tar.gz.

File metadata

  • Download URL: ollm-0.2.1.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ollm-0.2.1.tar.gz
Algorithm Hash digest
SHA256 367179c33c248cd6a0d33f70b7096ffa50d7328bdc091ebeabcd0c872b448325
MD5 859efac1b84d3f5de473b692203b9316
BLAKE2b-256 4543dc13982fdae9f301dd60d5e70297cf14eee20488ee2b097f3423d0e0e6e7

See more details on using hashes here.

File details

Details for the file ollm-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: ollm-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ollm-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6f188d29c027840c6de995fe9f80ca4256f820a97cecc55e9f124597bb767ff8
MD5 5b7d4d71b8d6dfb6142a446458a06409
BLAKE2b-256 b7f120b57f4c42d054e92c50d127c4fdc216a1e4a673f62d753496e7a77fbfed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page