Efficient LLM inference with .oom format - 2x smaller than GGUF. Dual GPU support, RoPE, KV-Cache & Flash Attention! pip install oomllama[cuda]
Project description
๐ฆ OomLlama
Efficient LLM inference with .oom format - 2x smaller than GGUF
from oomllama import OomLlama
llm = OomLlama("humotica-32b")
response = llm.generate("What is the meaning of life?")
print(response)
What's New in v0.4.0
๐ Major Release - Correct Output at Last!
- Interleaved RoPE Fix: Fixed critical bug where GGUF weights use interleaved format for Q/K projections
- GGUF stores Q/K as
[x0, x32, x1, x33, ...]instead of[x0, x1, ..., x31, x32, ...] - Now using correct
apply_rope()(interleaved) instead ofapply_rope_llama()(non-interleaved) - This was the root cause of position-dependent errors in attention output
- GGUF stores Q/K as
- Verified Output: TinyLlama now correctly outputs "2 + 2 = 4" and coherent text
- Clean Production Code: Removed all debug output for production deployment
- Qwen 7B/32B Support: Added support for Qwen 2.5 models (Q3_K_M, Q8_0)
v0.3.7
- Layer Pinning Enabled: Hot layers stay in VRAM
- 20GB VRAM Budget: Dual RTX 3060 config with first/last 4 layers pinned
v0.3.6
- Q6_K Dequantizer Fix: Fixed critical bug in GGUF Q6_K tensor dequantization that caused
infvalues - Q4 Format: Upgraded from Q2 to Q4 quantization (4 bits = 16 levels) for better precision
- Correct Logits: Model now outputs proper logit values (~8-10 range vs millions before)
v0.3.5
- Dual GPU Support: Automatic layer striping across 2 GPUs
- Per-GPU RoPE: Each GPU has its own RoPE tensors
v0.3.3
- RoPE (Rotary Position Embedding): Proper position encoding for accurate text generation
- KV-Cache: 10-50x speedup by caching attention keys/values
- Flash Attention: Memory-efficient attention computation
- Smart Layer Pinning: Keep hot layers in VRAM with auto-eviction
- Qwen 2.5 Support: Optimized config for 32B/70B Qwen models
Why OomLlama?
| Feature | GGUF (Q4_K_M) | OOM (Q4) |
|---|---|---|
| 70B Model Size | ~40 GB | ~35 GB |
| 32B Model Size | ~20 GB | ~17 GB |
| RAM Usage | High | Lazy Loading |
| Format | Open | Open (MIT) |
OomLlama uses Q4 quantization (4 bits = 16 levels per weight) with lazy layer loading to run large models on consumer hardware.
Installation
Pre-built Wheel (Recommended for GPU)
# CUDA 12.x pre-built wheel (includes all dependencies)
pip install https://brein.jaspervandemeent.nl/static/wheels/oomllama-0.3.6-cp313-cp313-manylinux_2_39_x86_64.whl
From PyPI (builds from source)
# Basic installation - requires Rust toolchain + CUDA toolkit
pip install oomllama
# With NVIDIA runtime libraries
pip install oomllama[cuda]
Build Requirements:
- Python 3.8+
- Rust 1.70+ (
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh) - CUDA Toolkit 12.x (for GPU support)
- 8GB+ RAM for compilation
Troubleshooting Build:
# If nvidia-smi detection fails:
export CUDA_COMPUTE_CAP=86 # RTX 30xx
export CUDA_COMPUTE_CAP=89 # RTX 40xx
pip install oomllama
Quick Start
Download a Model
from oomllama import download_model
# Download from HuggingFace
model_path = download_model("humotica-32b")
Generate Text
from oomllama import OomLlama
llm = OomLlama("humotica-32b")
# Simple generation
response = llm.generate("Explain quantum computing in simple terms")
print(response)
# With parameters
response = llm.generate(
"Write a haiku about AI",
max_tokens=50,
temperature=0.8,
top_p=0.9
)
Chat Mode
messages = [
("user", "Hello! Who are you?"),
("assistant", "I'm OomLlama, an efficient LLM."),
("user", "What makes you efficient?"),
]
response = llm.chat(messages)
print(response)
Available Models
| Model | Parameters | Size (.oom) | HuggingFace |
|---|---|---|---|
| humotica-32b | 33B | ~17 GB | Link |
| llamaohm-70b | 70B | ~35 GB | Link |
| tinyllama-1b | 1.1B | ~600 MB | Link |
The .oom Format
OOM (OomLlama Model) is a compact model format:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Header: OOML (magic) + metadata โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Tensors: Q4 quantized (4 bits/weight)โ
โ - Scale + Min per 256-weight block โ
โ - 136 bytes per block โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Convert GGUF to OOM
# Using the CLI tool
gguf2oom model.gguf model.oom
# Check model info
gguf2oom --info model.gguf
Technical Details
Q4 Quantization
Each weight is stored as 4 bits (0-15) with per-block scale and minimum:
weight = q4_value * scale + min
Q4 provides 16 quantization levels per weight, balancing compression with model quality.
Lazy Layer Loading
OomLlama loads transformer layers on-demand, keeping only the active layer in memory:
Forward Pass:
Layer 0: Load โ Compute โ Unload
Layer 1: Load โ Compute โ Unload
...
Layer N: Load โ Compute โ Unload
This enables running 70B models on 24GB GPU RAM.
Credits
- Model Format: Gemini IDD & Root AI (Humotica AI Lab)
- Quantization: OomLlama.rs by Humotica
- Base Models: Meta Platforms, Inc. (Llama 3.3)
License
- OomLlama Code: MIT License
- Model Weights: Subject to original model licenses (e.g., Llama 3.3 Community License)
Links
- ๐ Humotica
- ๐ค HuggingFace Models
- ๐ฆ PyPI Package
- ๐ Issue Tracker
One Love, One fAmIly ๐
Built by Humotica AI Lab - Jasper, Claude, Gemini, Codex
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oomllama-0.4.0.tar.gz.
File metadata
- Download URL: oomllama-0.4.0.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e2d0ed8459ba0a6633b7d4c9e8c15dd6b94409a2cd2fb377d0ac8b3b4af1a891
|
|
| MD5 |
5c3d4d9c17067f8cbce64db554dc99e4
|
|
| BLAKE2b-256 |
970350ea4a2620aaba4f73dbb06ac8935147b8b6a3d0c15ee25b52d57a7e8e89
|
File details
Details for the file oomllama-0.4.0-cp311-cp311-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: oomllama-0.4.0-cp311-cp311-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 21.2 MB
- Tags: CPython 3.11, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
272533c4797ddd13a612da60e78a292ef4527613e76820a46cd59b8a78af834a
|
|
| MD5 |
e43c2eadfe05d6adf1844238f66a3b1d
|
|
| BLAKE2b-256 |
0aaf704f2da3ac7d1eff10e20874397b84e9692627894437eb9c1ee91e48f8da
|