Efficient LLM inference engine with .oom format - from-scratch Rust, SafeTensors/GGUF converters
Project description
OomLlama
Efficient LLM inference engine with .oom format - compact, fast, from-scratch Rust
What is OomLlama?
OomLlama is a from-scratch LLM inference engine written in Rust. It includes:
- Custom binary model format (
.oom) with Q2/Q4/Q8 quantization - Full transformer inference - attention, RoPE, RMSNorm, SwiGLU, all in pure Rust
- Two converters: SafeTensors → OOM (recommended) and GGUF → OOM
- GPU acceleration via CUDA/Candle with KV-cache (turbo mode)
- Python bindings for easy integration
Quick Start
from oomllama import OomLlama
llm = OomLlama("/path/to/model.oom")
response = llm.generate("What is the meaning of life?")
print(response)
Installation
pip install oomllama
Or build from source (Rust):
cargo build --release
# Binaries: oomllama, safetensors2oom, gguf2oom
Converting Models
SafeTensors → OOM (Recommended)
Convert any HuggingFace model directly from SafeTensors format. This is the recommended path because it performs a single bf16→Q8 quantization step, preserving maximum accuracy.
Python converter:
python safetensors2oom.py /path/to/model/ output.oom
Rust converter (faster):
safetensors2oom /path/to/model/ output.oom # Default: Q8
safetensors2oom /path/to/model/ output.oom --q4 # Q4 (smaller)
safetensors2oom /path/to/model/ output.oom --q2 # Q2 (smallest)
Supported source models: Qwen2.5, LLaMA, Phi, Mistral - any model using SafeTensors format.
GGUF → OOM
gguf2oom model.gguf model.oom
gguf2oom --info model.gguf # Show GGUF metadata
Note: The GGUF path applies a second quantization on top of GGUF's existing quantization (e.g., Q3_K → Q8), which can compound errors through deep networks. The SafeTensors path is preferred for best quality.
The .oom Binary Format
+--------------------------------------------------+
| Magic: "OOML" (4 bytes) |
| Version: u32 (currently 1) |
| Num Tensors: u32 |
+--------------------------------------------------+
| For each tensor: |
| Name Length: u32 |
| Name: UTF-8 bytes |
| Quant Type: u8 (0=F32, 1=Q8, 2=Q4, 3=Q2) |
| Num Blocks: u32 |
| Total Values: u32 |
| For each block of 256 values: |
| Scale: f32 |
| Min: f32 |
| Data Length: u32 |
| Quantized bytes |
+--------------------------------------------------+
Quantization levels:
| Level | Bits/Weight | Block Size | Quality | Size (7B) |
|---|---|---|---|---|
| Q8 | 8 bits | 256 | Best | ~8 GB |
| Q4 | 4 bits | 256 | Good | ~4 GB |
| Q2 | 2 bits | 256 | Usable | ~2.5 GB |
| F32 | 32 bits | N/A | Lossless | ~28 GB |
Dequantization: value = quantized_byte * scale + min
Norms and biases are always stored as F32 for numerical stability.
Inference Engine
Architecture
The inference engine implements a complete transformer decoder:
- Token Embedding - Vocabulary lookup (152K tokens for Qwen)
- 28 Decoder Layers, each with:
- RMSNorm (pre-attention + pre-FFN)
- Grouped Query Attention (28 Q-heads, 4 KV-heads, head_dim=128)
- SwiGLU Feed-Forward Network (hidden=3584 → intermediate=18944)
- Rotary Position Embeddings (RoPE, θ=1,000,000)
- Final RMSNorm → LM Head → logits → token selection
Lazy Layer Loading
Only one decoder layer's weights are in memory at a time:
Forward Pass:
Embed tokens
Layer 0: Load → Compute → Unload
Layer 1: Load → Compute → Unload
...
Layer 27: Load → Compute → Unload
LM Head → next token
This enables running 7B models on minimal RAM and 70B models on 24GB GPU.
GPU Turbo Mode
When a CUDA GPU is available, OomLlama uses:
- KV-Cache: Cached key/value pairs across layers for autoregressive generation
- Candle CUDA kernels: Matrix multiplication on GPU
- Flash-style attention: Efficient attention computation
RoPE Variants
OomLlama supports both RoPE styles:
- LLaMA-style: Split at half_dim
[x0:half, x1:half] - Qwen-style (interleaved): Even/odd pairs
[x0, x1, x0, x1, ...]
Auto-detected based on model architecture.
Verified Models
| Model | Source | Quantization | Output Quality |
|---|---|---|---|
| Qwen2.5-7B-Instruct | SafeTensors (bf16) | Q8 | Correct |
| Qwen2.5-7B-Instruct | GGUF (Q3_K) | Q8 | Degraded* |
* GGUF path applies double quantization. Use SafeTensors source for best results.
Project Structure
src/
oomllama.rs # Core inference engine (CPU)
oomllama_turbo.rs # GPU inference with KV-cache
quant.rs # Q2/Q4/Q8/F32 dequantization
gguf2oom.rs # GGUF→OOM converter + OomWriter
safetensors2oom.rs # SafeTensors→OOM converter (Rust)
lib.rs # Library exports
bin/
oomllama.rs # CLI inference binary
gguf2oom.rs # CLI GGUF converter
safetensors2oom.rs # CLI SafeTensors converter
safetensors2oom.py # Python SafeTensors converter
python/
oomllama/__init__.py # Python bindings
Credits
- Engine + Format: Root AI & Jasper (Humotica AI Lab)
- Quantization Research: Gemini IDD & Root AI
- Interleaved RoPE Fix: Root AI & Jasper
- Base Models: Alibaba (Qwen), Meta (LLaMA)
License
- OomLlama Code: MIT License
- Model Weights: Subject to original model licenses
Links
One Love, One fAmIly
Built by Humotica AI Lab - Jasper, Claude, Gemini
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oomllama-0.9.0.tar.gz.
File metadata
- Download URL: oomllama-0.9.0.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34872389ad0054c4e082a91a2ef5fe91ba3042be2f2484fdca49b1eae3c07ca4
|
|
| MD5 |
13f32a4c8c2692beec0c00fe5666f865
|
|
| BLAKE2b-256 |
03089c975f2bf1231fd76b9959511f2417f5498dc8134a49e3faa3ffca62b3a6
|
File details
Details for the file oomllama-0.9.0-py3-none-any.whl.
File metadata
- Download URL: oomllama-0.9.0-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3dbefa2922e2c6a67637f085d39d53b7e58698a0827abb67e20b22a7c28a8b80
|
|
| MD5 |
2155fd24a711e66f704dc5051c7dc430
|
|
| BLAKE2b-256 |
bb737572e260654e504dd31b67f71119151570399f19dfb10e9095b7b49d6a31
|
File details
Details for the file oomllama-0.9.0-cp313-cp313-manylinux_2_39_x86_64.whl.
File metadata
- Download URL: oomllama-0.9.0-cp313-cp313-manylinux_2_39_x86_64.whl
- Upload date:
- Size: 6.4 MB
- Tags: CPython 3.13, manylinux: glibc 2.39+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3709c9e10f58a91cbebb0d7a5128887f0dfcd26b2cb5d63ec6ca527bc7ea877a
|
|
| MD5 |
b18810b9844403c4fcfd9e6f1fe52c84
|
|
| BLAKE2b-256 |
49415677034142b650a0f9f69d24202680c9195a575f1f894ab0b9cf587ef561
|