Efficient LLM inference with .oom format - 2x smaller than GGUF
Project description
OomLlama
Efficient LLM inference with .oom format - 2x smaller than GGUF
Quick Start
from oomllama import OomLlama
llm = OomLlama("humotica-7b")
response = llm.generate("What is the meaning of life?")
print(response)
Why OomLlama?
| Feature | GGUF (Q4) | OOM (Q2) |
|---|---|---|
| 70B Model Size | ~40 GB | ~20 GB |
| 32B Model Size | ~20 GB | ~10 GB |
| RAM Usage | High | Lazy Loading |
| Format | Open | Open (MIT) |
OomLlama uses Q2 quantization with lazy layer loading to run large models on consumer hardware.
Installation
pip install oomllama
Features
- Q2 Quantization: 2-bit weights with per-block scale/min
- Lazy Layer Loading: Only active layer in memory
- Interleaved RoPE: Proper Qwen model support (no gibberish!)
- HuggingFace Integration: Download models directly
- GPU Inference: CUDA support via Candle
Available Models
| Model | Parameters | Size (.oom) | HuggingFace |
|---|---|---|---|
| humotica-7b | 7B | ~2.5 GB | Link |
| humotica-32b | 32B | ~10 GB | Link |
| LlamaOhm-70B | 70B | ~20 GB | Link |
The .oom Format
OOM (OomLlama Model) is a compact model format:
+--------------------------------------+
| Header: OOML (magic) + metadata |
+--------------------------------------+
| Tensors: Q2 quantized (2 bits/weight)|
| - Scale + Min per 256-weight block |
| - 68 bytes per block |
+--------------------------------------+
CLI Usage
# Run inference
oomllama generate "Hello, world!"
# Convert GGUF to OOM
gguf2oom model.gguf model.oom --quant q2
# Check model info
oomllama info model.oom
Technical Details
Q2 Quantization
Each weight is stored as 2 bits (0, 1, 2, or 3) with per-block scale and minimum:
weight = q2_value * scale + min
Interleaved RoPE (Qwen Fix)
OomLlama supports both LLaMA-style and Qwen-style RoPE:
- LLaMA-style: Split at half_dim
[x0:half, x1:half] - Qwen-style (interleaved): Even/odd pairs
[x0, x1, x0, x1, ...]
This fix prevents the "Chinese characters / gibberish" issue with Qwen models.
Lazy Layer Loading
Forward Pass:
Layer 0: Load -> Compute -> Unload
Layer 1: Load -> Compute -> Unload
...
Layer N: Load -> Compute -> Unload
This enables running 70B models on 24GB GPU RAM.
Credits
- Model Format: Gemini IDD & Root AI (Humotica AI Lab)
- Quantization: OomLlama.rs by Humotica
- Interleaved RoPE Fix: Root AI & Jasper
- Base Models: Meta (Llama), Alibaba (Qwen)
License
- OomLlama Code: MIT License
- Model Weights: Subject to original model licenses
Links
One Love, One fAmIly
Built by Humotica AI Lab - Jasper, Claude, Gemini
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oomllama-0.6.0.tar.gz.
File metadata
- Download URL: oomllama-0.6.0.tar.gz
- Upload date:
- Size: 4.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91628eacb046b4a7d8d8cfbb88b5518d34dfacb7427e7f9d9b46dd20661be1ec
|
|
| MD5 |
3b55659e9c38411732e998fb04d24568
|
|
| BLAKE2b-256 |
8c416f4bc9a82dbcba343b0a5fa3b29586e0ab4124f502829149071a3971e2db
|
File details
Details for the file oomllama-0.6.0-py3-none-any.whl.
File metadata
- Download URL: oomllama-0.6.0-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ffc918adf4f73c07439d50e04bff4ee262343fba834359ab61a114d97228138
|
|
| MD5 |
747aa32f6bdff1d430fddf6ff7224cc0
|
|
| BLAKE2b-256 |
43cb0ffb2b9b69e1d6e9a677fda9245a83f72f24b6d2846b44e6e8185e6ae863
|