Efficient LLM inference with .oom format - 2x smaller than GGUF
Project description
๐ฆ OomLlama
Efficient LLM inference with .oom format - 2x smaller than GGUF
from oomllama import OomLlama
llm = OomLlama("humotica-32b")
response = llm.generate("What is the meaning of life?")
print(response)
Why OomLlama?
| Feature | GGUF (Q4) | OOM (Q2) |
|---|---|---|
| 70B Model Size | ~40 GB | ~20 GB |
| 32B Model Size | ~20 GB | ~10 GB |
| RAM Usage | High | Lazy Loading |
| Format | Open | Open (MIT) |
OomLlama uses Q2 quantization with lazy layer loading to run large models on consumer hardware.
Installation
pip install oomllama
Quick Start
Download a Model
from oomllama import download_model
# Download from HuggingFace
model_path = download_model("humotica-32b")
Generate Text
from oomllama import OomLlama
llm = OomLlama("humotica-32b")
# Simple generation
response = llm.generate("Explain quantum computing in simple terms")
print(response)
# With parameters
response = llm.generate(
"Write a haiku about AI",
max_tokens=50,
temperature=0.8,
top_p=0.9
)
Chat Mode
messages = [
("user", "Hello! Who are you?"),
("assistant", "I'm OomLlama, an efficient LLM."),
("user", "What makes you efficient?"),
]
response = llm.chat(messages)
print(response)
Available Models
| Model | Parameters | Size (.oom) | HuggingFace |
|---|---|---|---|
| humotica-32b | 33B | ~10 GB | Link |
| llamaohm-70b | 70B | ~20 GB | Link |
| tinyllama-1b | 1.1B | ~400 MB | Link |
The .oom Format
OOM (OomLlama Model) is a compact model format:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Header: OOML (magic) + metadata โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Tensors: Q2 quantized (2 bits/weight)โ
โ - Scale + Min per 256-weight block โ
โ - 68 bytes per block โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Convert GGUF to OOM
# Using the CLI tool
gguf2oom model.gguf model.oom
# Check model info
gguf2oom --info model.gguf
Technical Details
Q2 Quantization
Each weight is stored as 2 bits (0, 1, 2, or 3) with per-block scale and minimum:
weight = q2_value * scale + min
This achieves ~2x compression over Q4 with acceptable quality loss for most tasks.
Lazy Layer Loading
OomLlama loads transformer layers on-demand, keeping only the active layer in memory:
Forward Pass:
Layer 0: Load โ Compute โ Unload
Layer 1: Load โ Compute โ Unload
...
Layer N: Load โ Compute โ Unload
This enables running 70B models on 24GB GPU RAM.
Credits
- Model Format: Gemini IDD & Root AI (Humotica AI Lab)
- Quantization: OomLlama.rs by Humotica
- Base Models: Meta Platforms, Inc. (Llama 3.3)
License
- OomLlama Code: MIT License
- Model Weights: Subject to original model licenses (e.g., Llama 3.3 Community License)
Links
- ๐ Humotica
- ๐ค HuggingFace Models
- ๐ฆ PyPI Package
- ๐ Issue Tracker
One Love, One fAmIly ๐
Built by Humotica AI Lab - Jasper, Claude, Gemini, Codex
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file oomllama-0.3.0.tar.gz.
File metadata
- Download URL: oomllama-0.3.0.tar.gz
- Upload date:
- Size: 115.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95d7212f339c8f40c690252aa0c327a6e194edaf86fd31fdc6c2fb97f6b2feb7
|
|
| MD5 |
8140eafae32e2d18dde761c196446321
|
|
| BLAKE2b-256 |
ed768f6b3097ccd89baa01374b17131523b6d72ee480ea39e1d136c29c4cc255
|