Skip to main content

Efficient LLM inference engine with .oom format - from-scratch Rust, SafeTensors/GGUF converters

Project description

OomLlama

Efficient LLM inference engine with .oom format - compact, fast, from-scratch Rust

PyPI License: MIT HuggingFace

What is OomLlama?

OomLlama is a from-scratch LLM inference engine written in Rust. It includes:

  • Custom binary model format (.oom) with Q2/Q4/Q8 quantization
  • Full transformer inference - attention, RoPE, RMSNorm, SwiGLU, all in pure Rust
  • Two converters: SafeTensors → OOM (recommended) and GGUF → OOM
  • GPU acceleration via CUDA/Candle with KV-cache (turbo mode)
  • Python bindings for easy integration

Quick Start

from oomllama import OomLlama

llm = OomLlama("/path/to/model.oom")
response = llm.generate("What is the meaning of life?")
print(response)

Installation

pip install oomllama

Or build from source (Rust):

cargo build --release
# Binaries: oomllama, safetensors2oom, gguf2oom

Converting Models

SafeTensors → OOM (Recommended)

Convert any HuggingFace model directly from SafeTensors format. This is the recommended path because it performs a single bf16→Q8 quantization step, preserving maximum accuracy.

Python converter:

python safetensors2oom.py /path/to/model/ output.oom

Rust converter (faster):

safetensors2oom /path/to/model/ output.oom         # Default: Q8
safetensors2oom /path/to/model/ output.oom --q4     # Q4 (smaller)
safetensors2oom /path/to/model/ output.oom --q2     # Q2 (smallest)

Supported source models: Qwen2.5, LLaMA, Phi, Mistral - any model using SafeTensors format.

GGUF → OOM

gguf2oom model.gguf model.oom
gguf2oom --info model.gguf          # Show GGUF metadata

Note: The GGUF path applies a second quantization on top of GGUF's existing quantization (e.g., Q3_K → Q8), which can compound errors through deep networks. The SafeTensors path is preferred for best quality.

The .oom Binary Format

+--------------------------------------------------+
| Magic: "OOML" (4 bytes)                          |
| Version: u32 (currently 1)                        |
| Num Tensors: u32                                  |
+--------------------------------------------------+
| For each tensor:                                  |
|   Name Length: u32                                |
|   Name: UTF-8 bytes                              |
|   Quant Type: u8 (0=F32, 1=Q8, 2=Q4, 3=Q2)     |
|   Num Blocks: u32                                |
|   Total Values: u32                              |
|   For each block of 256 values:                  |
|     Scale: f32                                   |
|     Min: f32                                     |
|     Data Length: u32                              |
|     Quantized bytes                              |
+--------------------------------------------------+

Quantization levels:

Level Bits/Weight Block Size Quality Size (7B)
Q8 8 bits 256 Best ~8 GB
Q4 4 bits 256 Good ~4 GB
Q2 2 bits 256 Usable ~2.5 GB
F32 32 bits N/A Lossless ~28 GB

Dequantization: value = quantized_byte * scale + min

Norms and biases are always stored as F32 for numerical stability.

Inference Engine

Architecture

The inference engine implements a complete transformer decoder:

  1. Token Embedding - Vocabulary lookup (152K tokens for Qwen)
  2. 28 Decoder Layers, each with:
    • RMSNorm (pre-attention + pre-FFN)
    • Grouped Query Attention (28 Q-heads, 4 KV-heads, head_dim=128)
    • SwiGLU Feed-Forward Network (hidden=3584 → intermediate=18944)
    • Rotary Position Embeddings (RoPE, θ=1,000,000)
  3. Final RMSNormLM Head → logits → token selection

Lazy Layer Loading

Only one decoder layer's weights are in memory at a time:

Forward Pass:
  Embed tokens
  Layer 0: Load → Compute → Unload
  Layer 1: Load → Compute → Unload
  ...
  Layer 27: Load → Compute → Unload
  LM Head → next token

This enables running 7B models on minimal RAM and 70B models on 24GB GPU.

GPU Turbo Mode

When a CUDA GPU is available, OomLlama uses:

  • KV-Cache: Cached key/value pairs across layers for autoregressive generation
  • Candle CUDA kernels: Matrix multiplication on GPU
  • Flash-style attention: Efficient attention computation

RoPE Variants

OomLlama supports both RoPE styles:

  • LLaMA-style: Split at half_dim [x0:half, x1:half]
  • Qwen-style (interleaved): Even/odd pairs [x0, x1, x0, x1, ...]

Auto-detected based on model architecture.

Verified Models

Model Source Quantization Output Quality
Qwen2.5-7B-Instruct SafeTensors (bf16) Q8 Correct
Qwen2.5-7B-Instruct GGUF (Q3_K) Q8 Degraded*

* GGUF path applies double quantization. Use SafeTensors source for best results.

Project Structure

src/
  oomllama.rs          # Core inference engine (CPU)
  oomllama_turbo.rs    # GPU inference with KV-cache
  quant.rs             # Q2/Q4/Q8/F32 dequantization
  gguf2oom.rs          # GGUF→OOM converter + OomWriter
  safetensors2oom.rs   # SafeTensors→OOM converter (Rust)
  lib.rs               # Library exports
  bin/
    oomllama.rs        # CLI inference binary
    gguf2oom.rs        # CLI GGUF converter
    safetensors2oom.rs # CLI SafeTensors converter
safetensors2oom.py     # Python SafeTensors converter
python/
  oomllama/__init__.py # Python bindings

Credits

  • Engine + Format: Root AI & Jasper (Humotica AI Lab)
  • Quantization Research: Gemini IDD & Root AI
  • Interleaved RoPE Fix: Root AI & Jasper
  • Base Models: Alibaba (Qwen), Meta (LLaMA)

License

  • OomLlama Code: MIT License
  • Model Weights: Subject to original model licenses

Links


One Love, One fAmIly

Built by Humotica AI Lab - Jasper, Claude, Gemini

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oomllama-0.8.0.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oomllama-0.8.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file oomllama-0.8.0.tar.gz.

File metadata

  • Download URL: oomllama-0.8.0.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for oomllama-0.8.0.tar.gz
Algorithm Hash digest
SHA256 bff4519021cd050b5f4fe91861c73d658438e728ad999decf9cec56e7bff699c
MD5 0e5492904135d208bedb59cdddd6752d
BLAKE2b-256 1ae4a1e629c64c2b2ad7c96d7117c83a6a7dd1ac7e76a40f35179f868085dc03

See more details on using hashes here.

File details

Details for the file oomllama-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: oomllama-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for oomllama-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8658631fc174461ea37a5de06dfaac71a147df58be3fa36ccc49e50578cdc7d7
MD5 bede821c0923c9788f99b9adab1ef4b1
BLAKE2b-256 cfc3135cc770f25c782705f9ca088679163c848e1f2ac286187c7d6e94d0ea12

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page