Skip to main content

Efficient LLM inference with .oom format - 2x smaller than GGUF

Project description

OomLlama

Efficient LLM inference with .oom format - 2x smaller than GGUF

PyPI License: MIT HuggingFace

Quick Start

from oomllama import OomLlama

llm = OomLlama("humotica-7b")
response = llm.generate("What is the meaning of life?")
print(response)

Why OomLlama?

Feature GGUF (Q4) OOM (Q2)
70B Model Size ~40 GB ~20 GB
32B Model Size ~20 GB ~10 GB
RAM Usage High Lazy Loading
Format Open Open (MIT)

OomLlama uses Q2 quantization with lazy layer loading to run large models on consumer hardware.

Installation

pip install oomllama

Features

  • Q2 Quantization: 2-bit weights with per-block scale/min
  • Lazy Layer Loading: Only active layer in memory
  • Interleaved RoPE: Proper Qwen model support (no gibberish!)
  • HuggingFace Integration: Download models directly
  • GPU Inference: CUDA support via Candle

Available Models

Model Parameters Size (.oom) HuggingFace
humotica-7b 7B ~2.5 GB Link
humotica-32b 32B ~10 GB Link
LlamaOhm-70B 70B ~20 GB Link

The .oom Format

OOM (OomLlama Model) is a compact model format:

+--------------------------------------+
| Header: OOML (magic) + metadata      |
+--------------------------------------+
| Tensors: Q2 quantized (2 bits/weight)|
| - Scale + Min per 256-weight block   |
| - 68 bytes per block                 |
+--------------------------------------+

CLI Usage

# Run inference
oomllama generate "Hello, world!"

# Convert GGUF to OOM
gguf2oom model.gguf model.oom --quant q2

# Check model info
oomllama info model.oom

Technical Details

Q2 Quantization

Each weight is stored as 2 bits (0, 1, 2, or 3) with per-block scale and minimum:

weight = q2_value * scale + min

Interleaved RoPE (Qwen Fix)

OomLlama supports both LLaMA-style and Qwen-style RoPE:

  • LLaMA-style: Split at half_dim [x0:half, x1:half]
  • Qwen-style (interleaved): Even/odd pairs [x0, x1, x0, x1, ...]

This fix prevents the "Chinese characters / gibberish" issue with Qwen models.

Lazy Layer Loading

Forward Pass:
  Layer 0: Load -> Compute -> Unload
  Layer 1: Load -> Compute -> Unload
  ...
  Layer N: Load -> Compute -> Unload

This enables running 70B models on 24GB GPU RAM.

Credits

  • Model Format: Gemini IDD & Root AI (Humotica AI Lab)
  • Quantization: OomLlama.rs by Humotica
  • Interleaved RoPE Fix: Root AI & Jasper
  • Base Models: Meta (Llama), Alibaba (Qwen)

License

  • OomLlama Code: MIT License
  • Model Weights: Subject to original model licenses

Links


One Love, One fAmIly

Built by Humotica AI Lab - Jasper, Claude, Gemini

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oomllama-0.6.0.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oomllama-0.6.0-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file oomllama-0.6.0.tar.gz.

File metadata

  • Download URL: oomllama-0.6.0.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for oomllama-0.6.0.tar.gz
Algorithm Hash digest
SHA256 91628eacb046b4a7d8d8cfbb88b5518d34dfacb7427e7f9d9b46dd20661be1ec
MD5 3b55659e9c38411732e998fb04d24568
BLAKE2b-256 8c416f4bc9a82dbcba343b0a5fa3b29586e0ab4124f502829149071a3971e2db

See more details on using hashes here.

File details

Details for the file oomllama-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: oomllama-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for oomllama-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2ffc918adf4f73c07439d50e04bff4ee262343fba834359ab61a114d97228138
MD5 747aa32f6bdff1d430fddf6ff7224cc0
BLAKE2b-256 43cb0ffb2b9b69e1d6e9a677fda9245a83f72f24b6d2846b44e6e8185e6ae863

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page