Skip to main content

Efficient LLM inference engine with .oom format - from-scratch Rust, SafeTensors/GGUF converters

Project description

OomLlama

Efficient LLM inference engine with .oom format - compact, fast, from-scratch Rust

PyPI License: MIT HuggingFace

What is OomLlama?

OomLlama is a from-scratch LLM inference engine written in Rust. It includes:

  • Custom binary model format (.oom) with Q2/Q4/Q8 quantization
  • Full transformer inference - attention, RoPE, RMSNorm, SwiGLU, all in pure Rust
  • Two converters: SafeTensors → OOM (recommended) and GGUF → OOM
  • GPU acceleration via CUDA/Candle with KV-cache (turbo mode)
  • Python bindings for easy integration

Quick Start

from oomllama import OomLlama

llm = OomLlama("/path/to/model.oom")
response = llm.generate("What is the meaning of life?")
print(response)

Installation

pip install oomllama

Or build from source (Rust):

cargo build --release
# Binaries: oomllama, safetensors2oom, gguf2oom

Converting Models

SafeTensors → OOM (Recommended)

Convert any HuggingFace model directly from SafeTensors format. This is the recommended path because it performs a single bf16→Q8 quantization step, preserving maximum accuracy.

Python converter:

python safetensors2oom.py /path/to/model/ output.oom

Rust converter (faster):

safetensors2oom /path/to/model/ output.oom         # Default: Q8
safetensors2oom /path/to/model/ output.oom --q4     # Q4 (smaller)
safetensors2oom /path/to/model/ output.oom --q2     # Q2 (smallest)

Supported source models: Qwen2.5, LLaMA, Phi, Mistral - any model using SafeTensors format.

GGUF → OOM

gguf2oom model.gguf model.oom
gguf2oom --info model.gguf          # Show GGUF metadata

Note: The GGUF path applies a second quantization on top of GGUF's existing quantization (e.g., Q3_K → Q8), which can compound errors through deep networks. The SafeTensors path is preferred for best quality.

The .oom Binary Format

+--------------------------------------------------+
| Magic: "OOML" (4 bytes)                          |
| Version: u32 (currently 1)                        |
| Num Tensors: u32                                  |
+--------------------------------------------------+
| For each tensor:                                  |
|   Name Length: u32                                |
|   Name: UTF-8 bytes                              |
|   Quant Type: u8 (0=F32, 1=Q8, 2=Q4, 3=Q2)     |
|   Num Blocks: u32                                |
|   Total Values: u32                              |
|   For each block of 256 values:                  |
|     Scale: f32                                   |
|     Min: f32                                     |
|     Data Length: u32                              |
|     Quantized bytes                              |
+--------------------------------------------------+

Quantization levels:

Level Bits/Weight Block Size Quality Size (7B)
Q8 8 bits 256 Best ~8 GB
Q4 4 bits 256 Good ~4 GB
Q2 2 bits 256 Usable ~2.5 GB
F32 32 bits N/A Lossless ~28 GB

Dequantization: value = quantized_byte * scale + min

Norms and biases are always stored as F32 for numerical stability.

Inference Engine

Architecture

The inference engine implements a complete transformer decoder:

  1. Token Embedding - Vocabulary lookup (152K tokens for Qwen)
  2. 28 Decoder Layers, each with:
    • RMSNorm (pre-attention + pre-FFN)
    • Grouped Query Attention (28 Q-heads, 4 KV-heads, head_dim=128)
    • SwiGLU Feed-Forward Network (hidden=3584 → intermediate=18944)
    • Rotary Position Embeddings (RoPE, θ=1,000,000)
  3. Final RMSNormLM Head → logits → token selection

Lazy Layer Loading

Only one decoder layer's weights are in memory at a time:

Forward Pass:
  Embed tokens
  Layer 0: Load → Compute → Unload
  Layer 1: Load → Compute → Unload
  ...
  Layer 27: Load → Compute → Unload
  LM Head → next token

This enables running 7B models on minimal RAM and 70B models on 24GB GPU.

GPU Turbo Mode

When a CUDA GPU is available, OomLlama uses:

  • KV-Cache: Cached key/value pairs across layers for autoregressive generation
  • Candle CUDA kernels: Matrix multiplication on GPU
  • Flash-style attention: Efficient attention computation

RoPE Variants

OomLlama supports both RoPE styles:

  • LLaMA-style: Split at half_dim [x0:half, x1:half]
  • Qwen-style (interleaved): Even/odd pairs [x0, x1, x0, x1, ...]

Auto-detected based on model architecture.

Verified Models

Model Source Quantization Output Quality
Qwen2.5-7B-Instruct SafeTensors (bf16) Q8 Correct
Qwen2.5-7B-Instruct GGUF (Q3_K) Q8 Degraded*

* GGUF path applies double quantization. Use SafeTensors source for best results.

Project Structure

src/
  oomllama.rs          # Core inference engine (CPU)
  oomllama_turbo.rs    # GPU inference with KV-cache
  quant.rs             # Q2/Q4/Q8/F32 dequantization
  gguf2oom.rs          # GGUF→OOM converter + OomWriter
  safetensors2oom.rs   # SafeTensors→OOM converter (Rust)
  lib.rs               # Library exports
  bin/
    oomllama.rs        # CLI inference binary
    gguf2oom.rs        # CLI GGUF converter
    safetensors2oom.rs # CLI SafeTensors converter
safetensors2oom.py     # Python SafeTensors converter
python/
  oomllama/__init__.py # Python bindings

Credits

  • Engine + Format: Root AI & Jasper (Humotica AI Lab)
  • Quantization Research: Gemini IDD & Root AI
  • Interleaved RoPE Fix: Root AI & Jasper
  • Base Models: Alibaba (Qwen), Meta (LLaMA)

License

  • OomLlama Code: MIT License
  • Model Weights: Subject to original model licenses

Links


One Love, One fAmIly

Built by Humotica AI Lab - Jasper, Claude, Gemini

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oomllama-0.9.0.tar.gz (5.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

oomllama-0.9.0-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

oomllama-0.9.0-cp313-cp313-manylinux_2_39_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.39+ x86-64

File details

Details for the file oomllama-0.9.0.tar.gz.

File metadata

  • Download URL: oomllama-0.9.0.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for oomllama-0.9.0.tar.gz
Algorithm Hash digest
SHA256 34872389ad0054c4e082a91a2ef5fe91ba3042be2f2484fdca49b1eae3c07ca4
MD5 13f32a4c8c2692beec0c00fe5666f865
BLAKE2b-256 03089c975f2bf1231fd76b9959511f2417f5498dc8134a49e3faa3ffca62b3a6

See more details on using hashes here.

File details

Details for the file oomllama-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: oomllama-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for oomllama-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3dbefa2922e2c6a67637f085d39d53b7e58698a0827abb67e20b22a7c28a8b80
MD5 2155fd24a711e66f704dc5051c7dc430
BLAKE2b-256 bb737572e260654e504dd31b67f71119151570399f19dfb10e9095b7b49d6a31

See more details on using hashes here.

File details

Details for the file oomllama-0.9.0-cp313-cp313-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for oomllama-0.9.0-cp313-cp313-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 3709c9e10f58a91cbebb0d7a5128887f0dfcd26b2cb5d63ec6ca527bc7ea877a
MD5 b18810b9844403c4fcfd9e6f1fe52c84
BLAKE2b-256 49415677034142b650a0f9f69d24202680c9195a575f1f894ab0b9cf587ef561

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page