Efficient LLM inference with .oom format - 2x smaller than GGUF

These details have not been verified by PyPI

Project links

Project description

OomLlama

Efficient LLM inference with .oom format - 2x smaller than GGUF

Quick Start

from oomllama import OomLlama

llm = OomLlama("humotica-7b")
response = llm.generate("What is the meaning of life?")
print(response)

Why OomLlama?

Feature	GGUF (Q4)	OOM (Q2)
70B Model Size	~40 GB	~20 GB
32B Model Size	~20 GB	~10 GB
RAM Usage	High	Lazy Loading
Format	Open	Open (MIT)

OomLlama uses Q2 quantization with lazy layer loading to run large models on consumer hardware.

Installation

pip install oomllama

Features

Q2 Quantization: 2-bit weights with per-block scale/min
Lazy Layer Loading: Only active layer in memory
Interleaved RoPE: Proper Qwen model support (no gibberish!)
HuggingFace Integration: Download models directly
GPU Inference: CUDA support via Candle

Available Models

Model	Parameters	Size (.oom)	HuggingFace
humotica-7b	7B	~2.5 GB	Link
humotica-32b	32B	~10 GB	Link
LlamaOhm-70B	70B	~20 GB	Link

The .oom Format

OOM (OomLlama Model) is a compact model format:

+--------------------------------------+
| Header: OOML (magic) + metadata      |
+--------------------------------------+
| Tensors: Q2 quantized (2 bits/weight)|
| - Scale + Min per 256-weight block   |
| - 68 bytes per block                 |
+--------------------------------------+

CLI Usage

# Run inference
oomllama generate "Hello, world!"

# Convert GGUF to OOM
gguf2oom model.gguf model.oom --quant q2

# Check model info
oomllama info model.oom

Technical Details

Q2 Quantization

Each weight is stored as 2 bits (0, 1, 2, or 3) with per-block scale and minimum:

weight = q2_value * scale + min

Interleaved RoPE (Qwen Fix)

OomLlama supports both LLaMA-style and Qwen-style RoPE:

LLaMA-style: Split at half_dim [x0:half, x1:half]
Qwen-style (interleaved): Even/odd pairs [x0, x1, x0, x1, ...]

This fix prevents the "Chinese characters / gibberish" issue with Qwen models.

Lazy Layer Loading

Forward Pass:
  Layer 0: Load -> Compute -> Unload
  Layer 1: Load -> Compute -> Unload
  ...
  Layer N: Load -> Compute -> Unload

This enables running 70B models on 24GB GPU RAM.

Credits

Model Format: Gemini IDD & Root AI (Humotica AI Lab)
Quantization: OomLlama.rs by Humotica
Interleaved RoPE Fix: Root AI & Jasper
Base Models: Meta (Llama), Alibaba (Qwen)

License

OomLlama Code: MIT License
Model Weights: Subject to original model licenses

Links

One Love, One fAmIly

Built by Humotica AI Lab - Jasper, Claude, Gemini

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0a2 pre-release

Apr 19, 2026

1.0.0a1 pre-release

Apr 19, 2026

0.9.0

Mar 16, 2026

0.8.0

Feb 26, 2026

0.7.0

Feb 25, 2026

This version

0.6.0

Feb 21, 2026

0.5.0

Feb 20, 2026

0.4.0

Feb 2, 2026

0.3.7

Jan 17, 2026

0.3.6

Jan 17, 2026

0.3.5

Jan 17, 2026

0.3.4

Jan 17, 2026

0.3.2

Jan 17, 2026

0.3.1

Jan 17, 2026

0.3.0

Jan 17, 2026

0.2.0

Jan 17, 2026

0.1.0

Jan 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oomllama-0.6.0.tar.gz (4.2 kB view details)

Uploaded Feb 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oomllama-0.6.0-py3-none-any.whl (4.9 kB view details)

Uploaded Feb 21, 2026 Python 3

File details

Details for the file oomllama-0.6.0.tar.gz.

File metadata

Download URL: oomllama-0.6.0.tar.gz
Upload date: Feb 21, 2026
Size: 4.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for oomllama-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`91628eacb046b4a7d8d8cfbb88b5518d34dfacb7427e7f9d9b46dd20661be1ec`
MD5	`3b55659e9c38411732e998fb04d24568`
BLAKE2b-256	`8c416f4bc9a82dbcba343b0a5fa3b29586e0ab4124f502829149071a3971e2db`

See more details on using hashes here.

File details

Details for the file oomllama-0.6.0-py3-none-any.whl.

File metadata

Download URL: oomllama-0.6.0-py3-none-any.whl
Upload date: Feb 21, 2026
Size: 4.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for oomllama-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2ffc918adf4f73c07439d50e04bff4ee262343fba834359ab61a114d97228138`
MD5	`747aa32f6bdff1d430fddf6ff7224cc0`
BLAKE2b-256	`43cb0ffb2b9b69e1d6e9a677fda9245a83f72f24b6d2846b44e6e8185e6ae863`

See more details on using hashes here.

oomllama 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OomLlama

Quick Start

Why OomLlama?

Installation

Features

Available Models

The .oom Format

CLI Usage

Technical Details

Q2 Quantization

Interleaved RoPE (Qwen Fix)

Lazy Layer Loading

Credits

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes