Efficient LLM inference engine with .oom format - from-scratch Rust, SafeTensors/GGUF converters

These details have not been verified by PyPI

Project links

Project description

OomLlama

Efficient LLM inference engine with .oom format - compact, fast, from-scratch Rust

What is OomLlama?

OomLlama is a from-scratch LLM inference engine written in Rust. It includes:

Custom binary model format (.oom) with Q2/Q4/Q8 quantization
Full transformer inference - attention, RoPE, RMSNorm, SwiGLU, all in pure Rust
Two converters: SafeTensors → OOM (recommended) and GGUF → OOM
GPU acceleration via CUDA/Candle with KV-cache (turbo mode)
Python bindings for easy integration

Quick Start

from oomllama import OomLlama

llm = OomLlama("/path/to/model.oom")
response = llm.generate("What is the meaning of life?")
print(response)

Installation

pip install oomllama

Or build from source (Rust):

cargo build --release
# Binaries: oomllama, safetensors2oom, gguf2oom

Converting Models

SafeTensors → OOM (Recommended)

Convert any HuggingFace model directly from SafeTensors format. This is the recommended path because it performs a single bf16→Q8 quantization step, preserving maximum accuracy.

Python converter:

python safetensors2oom.py /path/to/model/ output.oom

Rust converter (faster):

safetensors2oom /path/to/model/ output.oom         # Default: Q8
safetensors2oom /path/to/model/ output.oom --q4     # Q4 (smaller)
safetensors2oom /path/to/model/ output.oom --q2     # Q2 (smallest)

Supported source models: Qwen2.5, LLaMA, Phi, Mistral - any model using SafeTensors format.

GGUF → OOM

gguf2oom model.gguf model.oom
gguf2oom --info model.gguf          # Show GGUF metadata

Note: The GGUF path applies a second quantization on top of GGUF's existing quantization (e.g., Q3_K → Q8), which can compound errors through deep networks. The SafeTensors path is preferred for best quality.

The .oom Binary Format

+--------------------------------------------------+
| Magic: "OOML" (4 bytes)                          |
| Version: u32 (currently 1)                        |
| Num Tensors: u32                                  |
+--------------------------------------------------+
| For each tensor:                                  |
|   Name Length: u32                                |
|   Name: UTF-8 bytes                              |
|   Quant Type: u8 (0=F32, 1=Q8, 2=Q4, 3=Q2)     |
|   Num Blocks: u32                                |
|   Total Values: u32                              |
|   For each block of 256 values:                  |
|     Scale: f32                                   |
|     Min: f32                                     |
|     Data Length: u32                              |
|     Quantized bytes                              |
+--------------------------------------------------+

Quantization levels:

Level	Bits/Weight	Block Size	Quality	Size (7B)
Q8	8 bits	256	Best	~8 GB
Q4	4 bits	256	Good	~4 GB
Q2	2 bits	256	Usable	~2.5 GB
F32	32 bits	N/A	Lossless	~28 GB

Dequantization: value = quantized_byte * scale + min

Norms and biases are always stored as F32 for numerical stability.

Inference Engine

Architecture

The inference engine implements a complete transformer decoder:

Token Embedding - Vocabulary lookup (152K tokens for Qwen)
28 Decoder Layers, each with:
- RMSNorm (pre-attention + pre-FFN)
- Grouped Query Attention (28 Q-heads, 4 KV-heads, head_dim=128)
- SwiGLU Feed-Forward Network (hidden=3584 → intermediate=18944)
- Rotary Position Embeddings (RoPE, θ=1,000,000)
Final RMSNorm → LM Head → logits → token selection

Lazy Layer Loading

Only one decoder layer's weights are in memory at a time:

Forward Pass:
  Embed tokens
  Layer 0: Load → Compute → Unload
  Layer 1: Load → Compute → Unload
  ...
  Layer 27: Load → Compute → Unload
  LM Head → next token

This enables running 7B models on minimal RAM and 70B models on 24GB GPU.

GPU Turbo Mode

When a CUDA GPU is available, OomLlama uses:

KV-Cache: Cached key/value pairs across layers for autoregressive generation
Candle CUDA kernels: Matrix multiplication on GPU
Flash-style attention: Efficient attention computation

RoPE Variants

OomLlama supports both RoPE styles:

LLaMA-style: Split at half_dim [x0:half, x1:half]
Qwen-style (interleaved): Even/odd pairs [x0, x1, x0, x1, ...]

Auto-detected based on model architecture.

Verified Models

Model	Source	Quantization	Output Quality
Qwen2.5-7B-Instruct	SafeTensors (bf16)	Q8	Correct
Qwen2.5-7B-Instruct	GGUF (Q3_K)	Q8	Degraded*

* GGUF path applies double quantization. Use SafeTensors source for best results.

Project Structure

src/
  oomllama.rs          # Core inference engine (CPU)
  oomllama_turbo.rs    # GPU inference with KV-cache
  quant.rs             # Q2/Q4/Q8/F32 dequantization
  gguf2oom.rs          # GGUF→OOM converter + OomWriter
  safetensors2oom.rs   # SafeTensors→OOM converter (Rust)
  lib.rs               # Library exports
  bin/
    oomllama.rs        # CLI inference binary
    gguf2oom.rs        # CLI GGUF converter
    safetensors2oom.rs # CLI SafeTensors converter
safetensors2oom.py     # Python SafeTensors converter
python/
  oomllama/__init__.py # Python bindings

Credits

Engine + Format: Root AI & Jasper (Humotica AI Lab)
Quantization Research: Gemini IDD & Root AI
Interleaved RoPE Fix: Root AI & Jasper
Base Models: Alibaba (Qwen), Meta (LLaMA)

License

OomLlama Code: MIT License
Model Weights: Subject to original model licenses

Links

One Love, One fAmIly

Built by Humotica AI Lab - Jasper, Claude, Gemini

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0a2 pre-release

Apr 19, 2026

1.0.0a1 pre-release

Apr 19, 2026

This version

0.9.0

Mar 16, 2026

0.8.0

Feb 26, 2026

0.7.0

Feb 25, 2026

0.6.0

Feb 21, 2026

0.5.0

Feb 20, 2026

0.4.0

Feb 2, 2026

0.3.7

Jan 17, 2026

0.3.6

Jan 17, 2026

0.3.5

Jan 17, 2026

0.3.4

Jan 17, 2026

0.3.2

Jan 17, 2026

0.3.1

Jan 17, 2026

0.3.0

Jan 17, 2026

0.2.0

Jan 17, 2026

0.1.0

Jan 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oomllama-0.9.0.tar.gz (5.5 kB view details)

Uploaded Mar 16, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oomllama-0.9.0-py3-none-any.whl (6.1 kB view details)

Uploaded Mar 16, 2026 Python 3

oomllama-0.9.0-cp313-cp313-manylinux_2_39_x86_64.whl (6.4 MB view details)

Uploaded Apr 19, 2026 CPython 3.13manylinux: glibc 2.39+ x86-64

File details

Details for the file oomllama-0.9.0.tar.gz.

File metadata

Download URL: oomllama-0.9.0.tar.gz
Upload date: Mar 16, 2026
Size: 5.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for oomllama-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`34872389ad0054c4e082a91a2ef5fe91ba3042be2f2484fdca49b1eae3c07ca4`
MD5	`13f32a4c8c2692beec0c00fe5666f865`
BLAKE2b-256	`03089c975f2bf1231fd76b9959511f2417f5498dc8134a49e3faa3ffca62b3a6`

See more details on using hashes here.

File details

Details for the file oomllama-0.9.0-py3-none-any.whl.

File metadata

Download URL: oomllama-0.9.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 6.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for oomllama-0.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3dbefa2922e2c6a67637f085d39d53b7e58698a0827abb67e20b22a7c28a8b80`
MD5	`2155fd24a711e66f704dc5051c7dc430`
BLAKE2b-256	`bb737572e260654e504dd31b67f71119151570399f19dfb10e9095b7b49d6a31`

See more details on using hashes here.

File details

Details for the file oomllama-0.9.0-cp313-cp313-manylinux_2_39_x86_64.whl.

File metadata

Download URL: oomllama-0.9.0-cp313-cp313-manylinux_2_39_x86_64.whl
Upload date: Apr 19, 2026
Size: 6.4 MB
Tags: CPython 3.13, manylinux: glibc 2.39+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for oomllama-0.9.0-cp313-cp313-manylinux_2_39_x86_64.whl
Algorithm	Hash digest
SHA256	`3709c9e10f58a91cbebb0d7a5128887f0dfcd26b2cb5d63ec6ca527bc7ea877a`
MD5	`b18810b9844403c4fcfd9e6f1fe52c84`
BLAKE2b-256	`49415677034142b650a0f9f69d24202680c9195a575f1f894ab0b9cf587ef561`

See more details on using hashes here.

oomllama 0.9.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OomLlama

What is OomLlama?

Quick Start

Installation

Converting Models

SafeTensors → OOM (Recommended)

GGUF → OOM

The .oom Binary Format

Inference Engine

Architecture

Lazy Layer Loading

GPU Turbo Mode

RoPE Variants

Verified Models

Project Structure

Credits

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes